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Abstract 

Rough  terrain  autonomous  navigation  continues  to 
pose  a  challenge  to  the  robotics  community.  Robust 
navigation  by  a  mobile  robot  depends  not  only  on 
the  individual  performance  of  perception  and  plan¬ 
ning  systems,  but  on  how  well  these  systems  are  cou¬ 
pled.  When  traversing  complex  unstructured  terrain, 
this  coupling  (in  the  form  of  a  cost  function)  has  a 
large  impact  on  robot  behavior  and  performance,  ne¬ 
cessitating  a  robust  design.  This  paper  explores  the 
application  of  Learning  from  Demonstration  to  this 
task  for  the  Crusher  autonomous  navigation  plat¬ 
form.  Using  expert  examples  of  desired  navigation 
behavior,  mappings  from  both  online  and  offline  per¬ 
ceptual  data  to  planning  costs  are  learned.  Chal¬ 
lenges  in  adapting  existing  techniques  to  complex  on¬ 
line  planning  systems  and  imperfect  demonstration 
are  addressed,  along  with  additional  practical  consid¬ 
erations.  The  benefits  to  autonomous  performance  of 
this  approach  are  examined,  as  well  as  the  decrease 
in  necessary  designer  effort.  Experimental  results  are 
presented  from  autonomous  traverses  through  com¬ 
plex  natural  environments. 

1  Introduction 

The  capability  of  autonomous  robotic  systems  to  suc¬ 
cessfully  navigate  through  difficult  environments  con¬ 
tinues  to  advance  [Kelly  et  ah,  2006,  Buehler,  2006]. 
Ever  improving  high  resolution  sensors  and  percep¬ 
tion  algorithms  allow  a  mobile  robot  to  build  a  de¬ 
tailed  model  of  its  environment,  and  advances  in  plan¬ 
ning  systems  allow  for  the  generation  of  ever  more 
complex  routes  and  trajectories  towards  achieving  a 
navigation  goal.  However,  as  perception  and  plan¬ 
ning  systems  become  more  complex,  so  does  the  task 
of  coupling  these  systems. 

A  common  division  of  responsibility  in  mobile 


robotics  tasks  a  perception  system  with  construct¬ 
ing  a  discrete  model  of  the  environment.  This  model 
is  then  interpreted  into  an  appropriate  form  for  use 
by  a  planning  system,  which  determines  the  robot’s 
actions.  In  the  simplest  case,  this  interpretation  de¬ 
termines  which  locations  in  the  environment  model 
the  robot  can  and  cannot  traverse  through  (i.e.  an 
obstacle  versus  freespace  or  traversable  versus  non- 
traversable  distinction).  A  planning  system  then 
computes  a  path  through  traversable  regions  of  the 
environment.  In  this  way,  perception  and  planning 
are  coupled  through  a  binary  classification  of  the  en¬ 
vironment. 

While  such  a  binary  approach  was  utilized  in 
the  early  days  of  outdoor  autonomous  navigation 
[Olin  and  Tseng,  1991],  more  recent  work  of  the  past 
decade  has  shown  it  to  be  insufficient  for  navigating 
complex  unstructured  environments.  Rough  natural 
terrain  is  not  easily  partitioned  into  clear  traversable 
and  non-traver sable  classes;  while  certain  objects 
may  be  obvious  non-traver  sable  obstacles,  unstruc¬ 
tured  environments  generally  present  a  continuum  of 
terrain  that  could  fall  into  the  traversable  category 
such  as  steep  slopes,  ditches,  smaller  (surmountable) 
objects,  and  widely  varying  vegetation  (Figure  1). 
A  binary  interpretation  results  in  no  distinction  be¬ 
tween  these  different  terrain  features,  resulting  in  be¬ 
havior  that  is  either  overly  aggressive  or  conservative 
(Figure  2). 

As  these  issues  have  become  better  under¬ 
stood,  the  result  has  been  a  move  to  sys¬ 

tems  that  use  a  continuous  coupling  between 
perception  and  planning  [Kelly  et  ah,  2006, 

Lacaze  et  ah,  2002,  Stentz  et  ah,  2007, 

Singh  et  ah,  2000,  Urmson  et  ah,  2006, 

Biesiadecki  and  Maimone,  2006].  Such  a  cou¬ 
pling  is  commonly  called  a  Cost  Function.  A  cost 
function  maps  terrain  features  produced  by  percep¬ 
tion  to  a  scalar  cost  value,  with  lower  cost  terrain 
being  preferable  to  higher  cost  terrain.  A  planning 
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system  then  computes  a  trajectory  that  minimizes 
the  accrued  cost  of  traversed  terrain. 

While  using  a  continuous  definition  of  cost  allows 
for  more  intelligent  behavior,  it  also  increases  the 
complexity  of  deploying  a  properly  functioning  robot. 
Systems  that  used  a  traversable  or  non-traversable 
distinction  only  had  to  solve  a  binary  classification 
problem.  This  mapping  completely  determined  the 
behavior  of  the  robot  (with  respect  to  where  it  pre¬ 
ferred  to  drive);  essentially,  the  desired  behavior  of 
the  robot  was  encoded  in  the  classification  func¬ 
tion  mapping  perceptual  data  to  traversable  or  non- 
traversable.  However,  in  a  system  with  continuous 
costs,  in  essence  a  full  regression  problem  must  be 
solved.  That  is,  the  desired  behavior  of  a  robot  is 
encoded  not  in  a  mapping  from  terrain  properties  to 
a  binary  class,  but  in  a  mapping  from  terrain  prop¬ 
erties  to  a  scalar  value  that  encodes  preferences  (see 
Figure  2). 

This  mapping  from  terrain  properties  to  cost  is 
far  more  complex  than  a  binary  mapping,  as  it  en¬ 
codes  far  more  complex  behavior  (through  continu¬ 
ous  output).  Worse,  the  preferences  amongst  terrain 
types  are  not  always  well  understood,  because  the 
metric  for  performance  is  rarely  concretely  defined. 
Common  metrics  for  autonomous  behavior  include 
maximizing  safety  or  probability  of  success,  minimiz¬ 
ing  distance  traveled  or  time  taken,  minimizing  net 
energy  loss,  minimizing  observability  or  maximizing 
sensor  coverage.  However,  often  the  actual  desired 
robot  behavior  optimizes  a  combination  of  such  met¬ 
rics;  for  example,  it  may  be  desirable  for  a  robot  to 
approximately  maximize  safety  but  take  certain  risks 
to  minimize  distance  traveled.  Encoding  a  single  met¬ 
ric  into  a  cost  function  is  sufficiently  difficult,  but 
properly  inferring  the  correct  weighting  to  construct 
a  multi-criterion  optimization  problem  is  even  more 
daunting.  Fundamentally,  while  humans  are  good  at 
driving  in  complex  or  off-road  environments,  they  are 
not  good  at  articulating  their  preferences  in  doing  so. 

Therefore,  as  preferences  that  mobile  robotic  sys¬ 
tems  are  expected  to  exhibit  become  more  complex, 
the  task  of  encoding  these  preferences  in  a  cost  func¬ 
tion  will  become  both  more  difficult  and  time  con¬ 
suming,  while  at  the  same  time  becoming  more  cen¬ 
tral  to  improving  performance.  Unfortunately,  this 
challenge  has  not  received  a  great  deal  of  focus;  it  is 
often  only  briefly  mentioned  in  the  literature.  With 
respect  to  costs  defined  over  patches  of  terrain,  the 
mapping  to  cost  from  terrain  parameters  or  features 
is  rarely  described  in  detail,  usually  with  the  state¬ 
ment  that  it  was  simply  constructed  and  hand-tuned 
to  provide  good  empirical  performance. 

The  issue  of  cost  function  design  and  tuning 


was  central  during  the  development  of  Crusher 
[Stentz  et  ah,  2007,  Bagnell  et  ah,  2010] (Figure  1),  a 
vehicle  designed  from  scratch  for  off-road  autonomous 
mobility.  Crusher  is  capable  of  traversing  steep 
slopes,  deep  washes  and  ditches,  large  boulders,  and 
dense  vegetation.  Robust  navigation  through  such 
terrain  requires  a  proper  description  of  preferences 
over  such  terrain  (e.g.  how  far  should  Crusher  travel 
through  dense  vegetation  to  avoid  a  small  ditch). 
Complicating  matters  further  is  the  complexity  of  the 
perceptual  inputs  that  navigating  such  environments 
necessitates;  complex  terrain  requires  a  high  dimen¬ 
sional  description  in  order  to  sufficiently  encode  all 
the  relevant  features  (see  Figure  3).  Finally,  Crusher 
receives  multiple  sources  of  perceptual  input:  along 
with  the  ever  changing  stream  of  perceptual  data 
produced  onboard  the  vehicle,  static  prior  data  was 
available  at  varying  resolution  over  large  areas  of  op¬ 
eration  (potentially  hundreds  of  square  kilometers) 
[Silver  et  ah,  2006,  Silver  et  ah,  2008].  This  necessi¬ 
tated  multiple  cost  functions  for  interpreting  these 
disjoint  data  sources  that  encoded  robust  behavior 
both  individually  and  when  fused  together. 

This  paper  explores  the  application  of  a  learn¬ 
ing  from  demonstration  approach  to  this  challenge. 
Specifically,  algorithms  are  presented  for  learning 
generalizable  terrain  cost  functions  from  expert 
demonstration  of  desired  behavior.  This  approach 
can  both  reduce  development  effort  and  improve  per¬ 
formance  when  applied  to  mobile  robotic  systems. 
The  next  section  provides  a  brief  overview  of  pre¬ 
vious  and  related  work  in  coupling  perception  and 
planning  systems  through  cost  functions.  Section  3 
presents  the  basic  theory  of  our  approach,  and  Sec¬ 
tion  4  describes  its  practical  adaptation  to  mobile 
robotics.  Experimental  results  from  the  Crusher  sys¬ 
tem  are  provided  in  Section  5,  with  discussion  and 
conclusion  in  Section  6. 

2  Related  Work 

2.1  Hand  Tuned  Cost  Functions 

Manual  hand  tuning  and  engineering  has  by  far  been 
the  most  common  approach  to  constructing  cost 
functions  for  mobile  robotic  systems.  Often,  this  is 
done  with  little  or  no  formalism;  cost  function  design 
by  hand  remains  one  of  the  ’black  arts’  of  mobile 
robotics,  and  has  been  applied  to  untold  numbers  of 
robotic  systems.  However,  there  has  previously  been 
work  that  created  reusable  frameworks  for  manual 
design  and  tuning  of  cost  functions  to  map  perceptual 
features  into  scalar  costs  [Stentz  and  Hebert,  1995, 
Singh  et  ah,  2000,  Balakirsky  and  Lacaze,  2000, 
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Figure  1:  The  Crusher  autonomous  mobility  platform  is  capable  of  cross-country  traverse  through  rough, 
complex,  and  unstructured  terrain 


Seraji  and  Howard,  2002,  Huertas  et  ah,  2005, 

Biesiadecki  and  Maimone,  2006]  The  common 
thread  amongst  these  approaches  is  that  they  con¬ 
struct  a  parameterized  mapping  from  a  perceptual 
input  space  to  a  scalar  cost,  and  then  adjust  the 
mapping  to  produce  costs  that  result  in  good  robot 
performance.  In  this  way,  the  behavior  of  an  entire 
robotic  system  is  simply  tuned  to  get  a  desired 
result. 

This  approach  has  several  disadvantages.  It  is  po¬ 
tentially  quite  time  consuming;  as  complex  environ¬ 
ments  necessitate  full  featured  and  high  dimensional 
descriptions,  often  on  the  order  of  dozens  of  features 
per  discrete  location.  Worse  still,  there  is  often  not 
a  clear  relationship  between  these  features  and  cost. 
Therefore,  engineering  a  cost  function  by  hand  is  akin 
to  manually  solving  a  high  dimensional  optimization 
problem  using  local  gradient  methods.  Evaluating 
each  candidate  function  requires  validation  through 
either  actual  or  simulated  robot  performance.  Such 
a  manual  process  requires  a  very  detailed  knowledge 
of  both  a  robot’s  perception  and  planning  systems; 
therefore  the  necessary  effort  must  come  from  a  po¬ 
tential  small  pool  of  full  system  experts. 

Additionally,  this  tedious  process  is  never  truly 
completed,  but  rather  remains  ongoing.  Whenever 
incorrect  behavior  is  observed,  the  cost  function  may 
need  to  be  revisited.  This  is  especially  problematic 
when  operating  in  a  novel  environment  or  scenario, 


as  there  is  no  guarantee  that  a  manually  tuned  cost 
function  will  generalize  well.  It  is  also  possible  that 
multiple  cost  functions  may  be  necessary  to  support 
different  subsets  of  available  perceptual  data.  Finally, 
if  the  perception  system  itself  is  ever  modified  to  add, 
remove,  or  modify  existing  features  (a  very  common 
occurrence  during  development  of  a  fielded  system), 
the  cost  function  must  also  be  redesigned  or  retuned. 

However,  perhaps  the  biggest  issue  with  manually 
tuning  a  cost  function  is  that  there  is  no  formalism 
behind  it.  That  is,  there  is  no  theory  to  explain  why  a 
certain  cost  function  produces  good  behavior,  or  how 
well  it  will  generalize  to  novel  environments.  Fur¬ 
ther,  even  if  sufficient  validation  is  performed  on  a 
candidate  cost  function,  there  is  nothing  to  indicate 
its  absolute  performance;  that  is,  it  may  be  sufficient, 
but  could  it  still  be  better?  As  with  many  optimiza¬ 
tion  procedures,  manual  parameter  tuning  is  likely 
to  suffer  diminishing  returns,  and  will  quickly  reach 
a  point  where  human  effort  and  patience  is  unlikely 
to  improve  a  parameterization  further.  Such  a  forced 
early  stopping  will  always  leave  lingering  questions, 
and  can  make  blame  assignment  difficult.  That  is,  if 
the  robot  experiences  a  navigation  failure  (e.g.  drives 
over  something  it  should  not  have,  or  avoids  some¬ 
thing  it  did  not  need  to)  its  unclear  whether  the 
blame  lies  with  perception,  planning,  or  their  cou¬ 
pling  (the  cost  function). 
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Figure  2:  Examples  illustrating  the  effect  of  different  cost  functions  on  a  robot’s  behavior.  Left:  In  simple 
environments  with  only  well  defined  obstacles  and  freespace,  a  wide  range  of  cost  functions  will  produce 
nearly  equivalent  behavior.  Center:  With  a  single  class  of  intermediate  terrain  (bushes),  several  different 
paths  are  optimal  depending  on  the  relative  cost  of  bushes  and  freespace.  Right:  With  multiple  classes  of 
intermediate  terrain,  the  variety  of  paths  that  could  be  optimal  (depending  on  the  relative  cost  of  bushes, 
grass,  and  freespace)  is  further  increased.  In  such  scenarios,  the  tuning  of  a  cost  function  will  dramatically 
affect  a  robot’s  behavior.  If  only  binary  cost  functions  were  utilized,  the  variety  of  reproducible  behavior 
would  be  diminished. 


2.2  Physical  Simulation 

Another  common  approach  to  engineering  this 
problem  away  is  the  use  of  accurate  physi¬ 
cal  simulation  [Olin  and  Tseng,  1991,  Kelly,  1995, 
Iagnemma  et  ah,  1999,  Helmick  et  ah,  2009]  to  at¬ 
tempt  to  predict  the  consequences  of  a  robot  travers¬ 
ing  a  patch  of  terrain.  Instead  of  requiring  a  mapping 
from  perceptual  features  to  cost,  this  can  transform 
the  problem  to  one  of  mapping  from  predicted  ve¬ 
hicle  state  to  cost.  At  times,  this  can  result  in  a 
simpler  or  more  intuitive  domain  for  constructing  a 
cost  function;  however,  it  could  also  make  the  prob¬ 
lem  more  difficult  (if  the  vehicle  state  is  higher  di¬ 
mensional  than  the  perceptual  feature  space).  Re¬ 
gardless,  there  is  still  a  requirement  to  construct  a 
mapping  from  a  description  of  a  behavior  to  a  cost 
function  that  implies  preferences. 

Another  possible  use  of  physical  simulation  is  to 
model  the  probability  of  terrain  being  traversable. 
That  is,  a  simulation  could  compute  the  probability 
that  interaction  with  a  specified  terrain  patch  would 
result  in  a  vehicle  failure  (e.g.  exceed  a  tip-over  an¬ 
gle  or  known  force  limits).  The  concept  of  explicitly 
using  analog  (as  opposed  to  binary)  traversability  as 
a  cost  function  (and  therefore  defining  terrain  prefer¬ 
ences  with  respect  to  traversability)  is  yet  another 
way  of  engineering  around  a  manually  tuned  cost 
function.  However,  it  is  not  without  serious  draw¬ 
backs.  Fundamental  amongst  them  is  that  it  lim¬ 
its  the  metric  that  the  autonomous  system  can  opti¬ 


mize  to  pure  maximization  of  vehicle  safety.  While 
in  certain  contexts  this  may  be  desirable,  the  inabil¬ 
ity  to  ever  indicate  that  slight  risk  should  be  taken 
to  balance  a  different  tradeoff  (distance  traveled,  time 
taken,  energy  consumption,  etc.)  can  seriously  limit  a 
system.  For  example,  a  mobile  system  that  navigated 
purely  based  on  traversability  would  likely  have  dif¬ 
ficulty  properly  differentiating  between  perfectly  flat 
and  slightly  angled  terrain,  or  between  small  obsta¬ 
cles  of  different  size,  or  sparse  vegetation  of  varying 
height.  The  end  result  is  that  a  traversability  score 
rearely  directly  maps  to  a  cost  that  produces  desired 
behavior;  a  cost  function  (that  may  heavily  depend 
on  traversability  estimates)  is  still  required.  Finally, 
performing  such  a  full,  accurate  simulation  is  quite 
difficult,  especially  using  noisy  perceptual  data  as  in¬ 
put. 

2.3  Supervised  Classification 

A  different  approach  that  can  also  simplify  the  pa¬ 
rameter  tuning  problem  is  the  use  of  supervised  clas¬ 
sification.  The  general  technique  is  to  reduce  a  high 
dimensional  perceptual  feature  space  into  a  lower  di¬ 
mensional  space  with  more  semantic  meaning.  Su¬ 
pervised  classification  is  an  obvious  choice  for  such 
a  transformation:  while  an  engineer  may  have  diffi¬ 
culty  in  designing  rules  to  classify  different  patches  of 
terrain,  he  can  much  more  easily  define  the  class  that 
each  patch  should  belong  to.  Rather  than  manually 
constructing  rules  to  map  from  perceptual  features  to 


4 


Figure  3:  Left:  Raw  perception  data  from  Crusher,  in  the  form  of  camera  images  and  colorized  LiDAR. 
Right:  2D  costs  derived  from  perception  data,  and  a  resulting  planned  path.  Brighter  pixels  indicate  higher 
cost. 


classifications,  a  learning  system  can  automatically 
generate  the  necessary  rules,  given  examples  of  sets 
of  features  and  their  desired  classification. 

The  primary  advantage  (with  respect  to  au¬ 
tonomous  behavior)  of  performing  a  supervised 
classification  is  a  remapping  of  a  perceptual  fea¬ 
ture  space  into  one  that  is  lower  dimensional  and 
potentially  more  intuitive.  Some  perceptual  fea¬ 
tures  have  an  intuitive,  monotonic  relationship 
to  concepts  such  as  safety  and  speed.  However, 
many  features  do  not  provide  this  intuition;  for 
example,  if  a  terrain  patch  has  a  high  blue  tint, 
does  that  make  it  more  or  less  dangerous?  If 
a  supervised  classification  stage  is  inserted  after 
perceptual  feature  generation  but  before  costing,  it 
may  simplify  the  parameter  tuning  problem;  not 
only  may  there  be  fewer  parameters,  but  they  may 
be  more  intuitive,  especially  if  classes  are  defined 
as  semantic  or  material  distinctions  (bush,  rock, 
tree  grass,  dirt,  etc.).  For  this  reason,  this  general 
approach  is  popular  for  the  purpose  of  perceptual 
interpretation  in  unstructured  environments,  and 
has  been  widely  applied  [Talukder  et  ah,  2002, 
Manduchi  et  ah,  2005,  Lalonde  et  ah,  2006, 

Bradley  et  ah,  2007,  Dima  et  ah,  2004, 

Rasmussen,  2002,  Angelova  et  ah,  2007, 

Halatci  et  ah,  2007,  Karlsen  and  Witus,  2007]. 

However,  while  supervised  multi-class  classification 
may  make  costing  more  tractable,  it  does  not  actually 
solve  the  core  problem;  the  parameter  tuning  task  has 
only  been  simplified.  While  classifier  outputs  may 
have  a  more  intuitive  relationship  to  the  correct  be¬ 
havior,  it  can  still  be  difficult  and  time  consuming 
to  determine  the  proper  relative  weightings  of  var¬ 


ious  classes.  As  there  is  no  guarantee  that  the  se¬ 
lected  taxonomy  is  relevant  to  mobility  behavior,  it 
may  take  a  lot  of  effort  for  a  classification  system 
to  distinguish  varying  classes  of  vegetation,  but  that 
effort  is  wasted  if  those  vegetation  classes  are  indis¬ 
tinguishable  with  regards  to  mobility.  In  such  a  case, 
a  large  error  in  classification  would  produce  only  a 
small  error  in  vehicle  behavior.  The  converse  can 
also  occur;  tiny  errors  in  classification  between  cer¬ 
tain  classes  may  produce  large  errors  in  behavior  (if 
for  example  traversable  vegetation  is  confused  with 
a  rigid  obstacle).  This  makes  it  difficult  to  relate 
classification  performance  to  mobility  performance. 
Defining  classes  specifically  with  respect  to  relative 
mobility  [Happold  and  Ollis,  2007]  is  a  partial  solu¬ 
tion,  but  makes  data  labeling  far  less  intuitive. 

Additionally,  this  approach  can  potentially  hurt 
overall  performance.  Classification  into  a  lower  di¬ 
mensional  space  can  be  viewed  as  a  form  of  dimen¬ 
sionality  reduction.  However,  as  classes  may  not  be 
directly  related  to  robot  mobility  or  behavior,  this  re¬ 
duction  does  not  take  into  account  the  final  use  of  the 
data,  and  can  potentially  obscure  or  eliminate  useful 
information.  Finally,  while  supervised  classification 
can  reduce  the  time  and  effort  required  in  one  tun¬ 
ing  problem,  the  total  effort  throughout  the  system  is 
not  necessarily  reduced;  the  classification  system  now 
must  be  tuned.  Labeling  a  large  representative  data 
set  for  training  and  validation  also  entails  a  signifi¬ 
cant  manual  effort.  Whenever  the  classifier  changes, 
due  to  perception  system  change  or  additions  to  the 
training  set,  the  costing  of  the  classifications  must  be 
re-examined  and  possibly  re-tuned. 

An  alternate  use  of  supervised  classification  has 


5 


been  to  treat  the  task  as  a  specific  two  class  prob¬ 
lem:  classifying  terrain  as  either  traversable  or 

non- traversable.  Labels  can  be  gathered  either  of¬ 
fline  [Seraji  and  Howard,  2002,  Howard  et  ah,  2007] 
or  online  [Thrun  et  ah,  2006,  Sun  et  ah,  2007]  by  ob¬ 
serving  where  an  expert  drives  a  robot;  nearby 
terrain  that  is  not  traversed  is  treated  as  having 
been  labeled  non-traversable  in  a  noisy  manner1. 
[Ollis  et  ah,  2007]  involves  similar  data  collection, 
but  only  for  labeling  of  traversable  terrain;  explicit 
examples  of  non-traversable  regions  are  not  required. 
The  issues  with  this  approach  are  twofold.  First,  the 
labeling  is  likely  to  be  very  noisy,  especially  if  exam¬ 
ples  of  non-traversable  terrain  are  obtained  by  using 
terrain  located  near  traversable  examples.  More  im¬ 
portantly,  just  as  with  physical  simulation  a  probabil¬ 
ity  of  traversability,  even  if  accurate,  rarely  directly 
maps  to  the  correct  behavior. 

2.4  Self-Supervised  Learning  from 
Experience 

In  contrast  with  learning  approaches  that  require  ex¬ 
plicit  supervision  from  an  (usually  human)  expert, 
self-supervised  approaches  require  no  expert  interac¬ 
tion.  Instead,  a  robot  uses  its  own  interactions  with 
the  world  to  slowly  learn  how  to  interpret  what  it  per¬ 
ceives;  the  robot  learns  and  adapts  from  experience. 
As  opposed  to  requiring  outside  supervision  and  in¬ 
struction,  the  robot  can  learn  from  its  own  mistakes. 
This  allows  robots  equipped  with  self-supervision  to 
adapt  to  novel  environments  or  scenarios  with  little 
to  no  human  intervention,  and  is  a  powerful  tool  for 
achieving  both  robustness  and  reusability.  Unsurpris¬ 
ingly,  online  self-supervised  approaches  to  learning 
have  gained  increasing  popularity  in  recent  years. 

Approaches  for  self-supervised  online  learning  can 
be  divided  into  two  distinct  classes.  The  first  is 
near-to-far  learning  that  learns  how  to  interpret 
a  far-range,  lower  resolution  sensor  using  a  near 
range,  higher-resolution  sensor.  Near-to-far  learning 
is  achieved  by  recalling  how  specific  patches  of  terrain 
were  perceived  by  a  far  range  sensor,  and  then  later 
observing  them  with  a  near-range  sensor.  This  pro¬ 
vides  a  correspondence  between  the  output  of  the  two 
sensing  modalities,  and  provides  the  necessary  data 
to  learn  a  mapping.  Examples  of  such  approaches  in¬ 
clude  learning  to  interpret  monocular  vision  systems 
from  shorter  range  LiDAR  [Dahlkamp  et  ah,  2006] 
or  stereo  [Hadsell  et  ah,  2009]  range  data,  learn- 

1This  approach  is  fundamentally  different  from  standard 
imitation  learning  in  that  demonstration  is  used  purely  as  a 
data  labeling  technique;  offline  hand  labeling  could  be  used 
with  similar  results 


Figure  4:  An  example  of  near-to-far  learning  from 
[Sofman  et  ah,  2006].  In  this  example,  as  a  robot 
drives  through  an  environment  (left  to  right)  it  uses 
its  short  range  sensors  to  train  the  interpretation  of 
a  far  range  sensor  (in  this  case  the  satellite  image  at 
left) 


ing  to  map  near  range  traversability  estimates 
to  far  range  estimates  [Bajracharya  et  ah,  2008, 
Procopio  et  ah,  2007],  and  learning  to  map  full  cost 
functions  from  near  to  far  [Sofman  et  ah,  2006]  (Fig¬ 
ure  4).  However,  near-to-far  learning  still  requires 
a  (preexisting)  correct  interpretation  of  near-range 
sensors.  In  fact,  this  near-range  interpretation  is  in¬ 
creased  in  importance,  since  it  is  being  used  as  a 
ground  truth  signal  for  learning. 

The  other  distinct  class  of  self-supervised  learning 
is  learning  from  proprioception,  also  called  learning 
from  underfoot.  As  opposed  to  near-to-far  learning, 
the  ground  truth  signal  comes  not  from  a  higher 
resolution  exteroceptive  sensor,  but  from  propri¬ 
oceptive  sensors.  Aside  from  this  distinction,  the 
methodology  is  quite  similar:  as  the  robot  drives 
over  a  patch  of  terrain,  it  recalls  how  that  terrain 
appeared  in  its  near-range  sensors  just  moments 
ago.  This  sets  up  a  correspondence  between  how 
terrain  appears,  and  how  the  robot  interacts  with 
it.  These  correspondences  can  be  used  to  learn 
to  model  the  robot’s  interactions  with  terrain 
by  predicting  various  terramechanical  properties, 
such  as  roughness  [Stavens  and  Thrun,  2006], 
vehicle  slip  [Angelova  et  ah,  2007],  soil  cohesion 
[Iagnemma  et  ah,  2004],  or  vegetation  height 
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[Wellington  et  al.,  2006].  Predictions  can  then  be 
either  be  used  directly  to  determine  robot  behav¬ 
ior  (i.e.  as  input  into  a  cost  function),  or  used 
to  improve  the  accuracy  of  a  physical  simulation 
[Helmick  et  ah,  2009].  While  these  approaches  cer¬ 
tainly  provide  useful  information  that  can  drastically 
improve  a  robot’s  robustness  and  adaptation,  they  do 
not  directly  address  the  general  problem  of  mapping 
from  features  of  terrain  to  terrain  preferences. 

Another  approach  to  learning  from  proprioception 
is  to  attempt  to  directly  learn  traversability  by  ob¬ 
serving  what  terrain  a  robot  can  and  can  not  success¬ 
fully  traverse  [Shneier  et  ah,  2008,  Kim  et  ah,  2006]. 
However,  this  approach  has  the  same  drawbacks 
as  previously  described  techniques  based  on  pure 
traversability.  That  is,  it  ignores  the  possibility  of  rel¬ 
ative  preferences  amongst  equally  traversable  patches 
of  terrain.  This  issue  also  applies  to  other  single 
metric  learning  from  proprioception;  for  example,  a 
robot  that  learns  to  estimate  its  own  speed  over  dif¬ 
ferent  terrains  would  never  be  able  to  differentiate 
between  terrains  on  which  maximum  speed  is  possi¬ 
ble  but  different  mobility  risk  is  encountered.  More 
importantly,  this  online  approach  to  labeling  terrain 
as  non-traversable  requires  explicit  interaction  with 
non- traversable  terrain  (e.g.  the  robot  must  drive 
into  an  obstacle  to  learn  that  it  is  an  obstacle).  Not 
only  is  this  dangerous,  it  also  results  in  a  poten¬ 
tially  difficult  blame  assignment  problem  (determin¬ 
ing  which  patch  or  patches  of  terrain  were  responsible 
for  a  failure). 

2.5  Learning  From  Expert  Demon¬ 
stration 

Given  the  difficulty  in  manually  engineering  a 
coupling  between  perception  and  planning  systems, 
an  alternative  solution  is  to  avoid  this  problem 
altogether  by  learning  to  directly  map  perception  to 
actions.  This  can  be  accomplished  through  learn¬ 
ing  from  demonstration,  also  known  as  imitation 
learning.  A  key  principle  of  imitation  learning  is 
that  while  it  may  be  very  difficult  to  quantify  why 
a  certain  behavior  is  desirable,  the  actual  correct 
behavior  is  usually  known  by  a  human  expert. 
Therefore,  rather  than  having  a  human  expert 
tune  a  system  to  achieve  desired  behavior,  the 
expert  can  demonstrate  desired  behavior  and  the 
robot  can  tune  itself  to  match  the  demonstration. 
Learning  from  demonstration  has  a  long  history 
with  mobile  robotics,  starting  with  the  ALVINN 
system  [Pomerleau,  1989]  and  continuing  through 
more  recent  applications  [LeCun  et  ah,  2006, 
Howard  et  ah,  2005,  Hamner  et  ah,  2006]. 


These  approaches  to  learning  from  demonstration 
all  fall  into  the  category  of  action  prediction:  given 
the  state  of  the  world,  or  features  of  the  state  of  the 
world,  they  attempt  to  predict  the  action  that  the  hu¬ 
man  demonstrator  would  have  selected.  The  funda¬ 
mental  problem  with  pure  action  prediction  is  that  it 
is  a  purely  reactive  approach  to  planning.  There  is  no 
attempt  to  model  or  reason  about  the  consequences  of 
a  chosen  action.  Therefore,  all  relevant  information 
must  be  encoded  in  the  current  state  or  features  of 
the  state.  For  certain  scenarios  this  is  feasible;  path 
trackers  are  a  classic  example.  However,  in  general 
the  use  of  purely  reactive  systems  is  incredibly  diffi¬ 
cult  for  planning  long  range,  goal  directed  behavior 
through  complex  unstructured  environments. 

An  alternative  to  action  prediction  finds  its  roots  in 
the  idea  of  Inverse  Optimal  Control  [Kalman,  1964]. 
While  optimal  control  seeks  a  trajectory  through  a 
state  space  that  optimizes  some  known  metric,  in¬ 
verse  optimal  control  seeks  a  metric  such  that  a 
known  trajectory  through  a  state  space  is  optimal  un¬ 
der  said  metric.  Within  mobile  robotics,  the  parallel 
would  be  to  learn  a  cost  function  such  that  a  robot’s 
planning  system  will  reproduce  expert2  demonstrated 
behavior.  Such  an  approach  automates  the  hand  con¬ 
struction  and  tuning  of  cost  functions  that  is  preva¬ 
lent  in  actual  deployments.  Not  only  does  automat¬ 
ing  this  procedure  result  in  large  reduction  in  design 
effort,  it  can  potentially  produce  a  better  coupling; 
automatic  optimization  need  not  be  nearly  as  con¬ 
cerned  with  diminishing  returns,  and  can  continue 
until  convergence.  Further,  an  automated  approach 
can  take  advantage  of  standard  cross  validation  tech¬ 
niques  to  ensure  that  a  learned  coupling  is  robust  and 
will  generalize  well.  Finally,  if  collection  of  expert 
demonstration  includes  raw  sensor  data,  changes  to 
sensor  processing  within  a  perception  system  do  not 
result  in  additional  human  effort  with  respect  to  a 
cost  function;  existing  demonstrations  can  simply  be 
reprocessed  and  coupling  relearned. 

Inverse  Reinforcement  Learning 

[Ng  and  Russell,  2000]  was  the  first  application 
of  this  idea  to  the  framework  of  Markov  Decision 
Processes  commonly  used  for  motion  planning  in 
mobile  robotics.  This  framework  was  later  modified 
into  a  new  approach  known  as  Apprenticeship  Learn¬ 
ing  [Abbeel  and  Ng,  2004].  Apprenticeship  learning 
resulted  in  an  algorithm  that  produced  linear  cost 
functions  such  that  any  planned  behavior  would  have 
the  same  cost  as  demonstrated  behavior  (between 
equivalent  start  and  end  conditions);  however  there 

2 This  expert  need  no  longer  be  a  full  system  expert,  but 
rather  only  needs  to  have  the  necessary  intuition  to  understand 
how  the  robot  should  drive 
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was  no  mechanism  for  explicitly  matching  expert 
behavior.  Additionally,  the  final  solution  was  a 
stochastic  mixture  of  multiple  policies.  The  Maxi¬ 
mum  Margin  Planning  (MMP)  [Ratliff  et  ah,  2006] 
framework  addressed  these  problems  by  producing 
a  single  deterministic  solution  while  also  ensuring 
an  upper  bound  on  the  mismatch  between  demon¬ 
strated  and  planned  behavior.  More  recent  work  has 
extended  the  MMP  framework  to  non-linear  cost 
functions  [Ratliff  et  ah,  2009,  Silver  et  ah,  2008],  and 
was  applied  to  Crusher  for  the  task  of  learning  a  cost 
function  to  interpret  both  static  and  dynamic  percep¬ 
tual  data  [Silver  et  ah,  2009a,  Silver  et  ah,  2009b]. 

3  Learning  to  Interpret  Per¬ 
ceptual  Data  from  Expert 
Demonstration 

The  MMP  framework  treats  learning  from  demon¬ 
stration  as  a  constrained  optimization  problem.  In 
decision  theory,  a  utility  function  is  defined  as  a  rela¬ 
tive  ordering  over  all  possible  options.  Since  most 
mobile  robotic  systems  encounter  an  infinite  num¬ 
ber  of  possible  plans,  such  an  explicit  ordering  is  not 
possible.  Instead,  plans  are  scored  by  a  cost  func¬ 
tion,  and  cost  defines  the  ordering.  However,  this 
core  idea  of  an  ordering  of  preferences  remains  use¬ 
ful.  If  an  expert  can  concretely  define  a  relative  or¬ 
dering  over  even  a  small  subset  of  possible  plans,  this 
ordering  becomes  a  constraint  on  the  cost  function: 
the  costs  assigned  to  each  possible  plan  must  match 
the  relative  ordering.  The  more  information  on  rela¬ 
tive  preferences  an  expert  provides,  the  more  the  cost 
function  is  constrained. 

This  section  first  derives  the  basic  MMP  algo¬ 
rithm  for  learning  a  linear  cost  function  to  repro¬ 
duce  expert  demonstration.  Next  the  LEARCH  al¬ 
gorithm  (LEArning  to  seaRCH)  [Ratliff  et  ah,  2009, 
Silver  et  ah,  2008]  is  presented  for  extending  this  ap¬ 
proach  to  non-linear  cost  functions.  The  MMP  frame¬ 
work  and  associated  algorithms  are  defined  over  gen¬ 
eral  Markov  Decision  Processes.  However,  without 
loss  of  generality,  this  derivation  is  restricted  to  deter¬ 
ministic  MDPs  with  a  set  of  absorbing  states;  that  is, 
goal  directed  path  or  motion  planners.  This  change 
is  purely  for  notational  simplification,  as  most  mo¬ 
bile  robotic  systems  use  planners  of  this  form.  Ad¬ 
ditionally,  only  cost  functions  defined  over  states  are 
considered,  as  opposed  to  full  state-action  pairs  (see 
Section  6). 


3.1  Maximum  Margin  Planning  with 
Linear  Cost  Functions 

Consider  a  state  space  S  through  which  a  planner 
operates  (e.g.  S  =  M2).  A  feature  space  T  is  de¬ 
fined  over  S.  That  is,  for  every  x  G  <S,  there  exists 
a  corresponding  feature  vector  Fx  e  T .  Fx  can  be 
considered  as  the  raw  output  of  a  perception  system 
at  state  x3.  For  the  output  of  a  perception  system  to 
be  used  by  a  planner,  it  must  be  mapped  to  a  scalar 
cost  value;  Therefore,  C  is  defined  as  a  cost  function, 
C  :  F  — ^  M+.  The  cost  of  a  state  x  is  C(FX).  For 
the  moment,  only  linear  cost  functions  of  the  form 
C(F)  =  wtF  are  considered;  the  weight  vector  w 
completely  defines  the  cost  function.  Finally,  a  path 
P  is  defined  as  a  sequence  of  states  in  S  that  lead 
from  a  start  s  to  a  goal  g.  The  cost  of  an  entire 
path  is  simply  defined  as  the  sum  of  the  costs  of  all 
states  along  the  path,  or  alternatively  the  cost  of  the 
cumulative  feature  counts 

c(p)  =  y,  C(F*)  =  E  wTp*  =  ™TYF*  (!) 

xEP  x£P  x£P 

Now  consider  an  example  path  Pe  from  a  start  state 
se  to  a  goal  state  ge.  If  this  example  path  is  pro¬ 
vided  via  expert  demonstration,  then  its  is  reasonable 
to  consider  applying  inverse  optimal  control;  that  is, 
seeking  to  find  a  cost  function  C  such  that  Pe  is  the 
optimal  path  from  se  to  ge.  While  a  single  example 
demonstration  does  not  imply  a  single  cost  function, 
it  does  constrain  the  space  of  cost  functions  C :  only 
cost  functions  that  consider  Pe  the  optimal  path  are 
acceptable.  For  now,  it  is  assumed  that  all  Pe  are  at 
least  near  optimal  under  at  least  one  C  G  C;  Section 

4.2  will  relax  this  assumption. 

If  a  regularization  term  is  also  added  to  encourage 
simple  solutions,  then  the  task  of  finding  an  accept¬ 
able  cost  function  from  an  example  can  be  phrased 
as  the  following  constrained  optimization  problem: 

minimize  0(w)  =  \\w\\ 2  (2) 

subject  to  the  constraints 

Y(wTp)  ^  Y(wTp^ 

x£p  Pe 

V-P  s.t.  P  7^  Pe,  s  =  <se,  g  =  ge 

Unfortunately,  this  optimization  has  a  trivial  so¬ 
lution:  w  =  0.  This  issue  can  be  addressed  by  in¬ 
cluding  a  margin  in  each  constraint.  The  size  of  the 

3  For  now  it  is  assumed  that  the  assignment  of  a  feature 
vector  to  a  state  is  static;  the  extension  to  dynamic  assignment 
is  addressed  in  Section  4.1 
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margin  is  dependent  on  the  similarity  between  paths; 
the  example  only  needs  to  be  slightly  lower  cost  than 
a  very  similar  path,  but  should  be  much  lower  cost 
than  a  very  different  path.  Similarity  between  Pe  and 
an  arbitrary  path  P  is  encoded  by  a  loss  function 
P(Pe,P),  or  alternatively  Pe(P)4.  The  definition  of 
the  loss  function  is  somewhat  application  dependant; 
the  simplest  form  would  be  to  simply  consider  how 
many  states  the  two  paths  share  (a  Hamming  loss). 
The  effect  of  the  scale  of  the  margin  is  removed  by 
the  regularization  term.  The  constrained  optimiza¬ 
tion  can  now  be  rewritten  as 


minimize  0(w)  =  ||u>||2  (3) 

subject  to  the  constraints 

Fx  -  Le(x))  >  X]  (wtFx) 
xeP  x^Pe 

VP  s.t.  P  ^  Pe,  s  =  se,  g  —  ge 


Le(x) 


1  if  x  G  Pe 
0  otherwise 


Depending  on  the  state  space,  and  the  distance 
from  se  to  ge ,  there  are  likely  to  be  an  infeasible  (and 
possibly  infinite)  number  of  constraints;  one  for  each 
alternate  path  to  the  demonstrated  example.  How¬ 
ever,  it  is  not  necessary  to  enforce  every  constraint. 
For  any  candidate  cost  function,  there  is  a  minimum 
cost  path  between  any  two  waypoints,  P*.  It  is  only 
necessary  to  enforce  the  constraint  for  P*,  as  once  it 
is  satisfied  by  definition  all  other  constraints  will  be 
satisfied.  With  this  single  constraint,  (3)  becomes 


minimize  0(w)  =  ||u>||2  (4) 

subject  to  the  constraint 

y  (■ wtfx  ~  Le(x))  >  y  ( wtfx ) 

xeP *  xePe 

P *  =  argmin  W  (wT Fx  -  Le(x)) 

P  - 

xeP 

It  may  not  always  be  possible  to  exactly  meet 
this  constraint  (the  margin  may  make  it  impossible). 
Therefore,  a  slack  term  £  is  added  to  allow  for  this 
possibility. 

4As  a  path  is  considered  just  a  sequence  of  states,  the  loss 
function  can  be  defined  either  over  a  full  path  or  over  a  single 
state. 


minimize  0{w)  =  A| \w\ |2  +  £  (5) 

subject  to  the  constraint 

y  (wTFx-Le(X))  -  y  + c  >  o 

xeP*  xePe 

The  slack  term  £  accounts  for  the  error  in  meet¬ 
ing  the  constraint,  while  A  balances  the  tradeoff  in 
the  objective  between  regularization  and  meeting  the 
constraint.  However,  the  slack  variable  will  always 
be  tight;  that  is  £  will  always  be  exactly  equal  to  the 
difference  in  path  costs.  Therefore,  £  can  be  replaced 
in  the  objective  by  the  constraint,  resulting  in  the 
following  (unconstrained)  optimization  problem 

minimize  0(w)  =  A||re||2  +  (6) 

y  (wtfx)  -  y  ( wtfx  -  Le(x)) 

x£Pe  x£P* 

or  alternatively 

minimize  0(w)  =  A||re||2  +  (7) 

y  (wtFx)  -  min  y  (wTFx  -  Le{x)) 
xePe  \_xeP 

The  final  optimization  seeks  to  minimize  the  dif¬ 
ference  in  cost  between  the  example  path  Pe  and  the 
(loss  augmented)  optimal  path  P*,  subject  to  reg¬ 
ularization.  0(w)  is  convex,  but  non-differentiable; 
therefore,  instead  of  gradient  descent,  it  can  be  min¬ 
imized  using  the  sub-gradient  method  with  learning 
rate  g.  The  sub-gradient  of  O  with  respect  to  w  is 

VO  =  2Xw  +  y  Fx  -  y  Fx  (8) 

xePe  xeP* 

Intuitively,  (8)  says  that  the  direction  that  will 
most  minimize  the  objective  function  is  found  by 
comparing  feature  counts.  If  more  of  a  certain  fea¬ 
ture  is  encountered  on  the  example  path  than  the 
current  minimum  cost  path  P*,  the  weight  on  that 
feature  (and  therefore  the  cost)  should  be  decreased. 
Likewise,  if  less  of  a  feature  is  encountered  on  the 
example  path  than  on  P*,  the  weight  should  be  in¬ 
creased.  Although  the  margin  does  not  appear  in  the 
final  sub-gradient,  it  does  affect  the  computation  of 
P* .  The  final  linear  MMP  algorithm  consists  of  iter¬ 
atively  computing  feature  counts  and  then  updating 
the  cost  function  until  convergence.  However,  one  fi¬ 
nal  caveat  is  to  ensure  that  only  cost  functions  that 
map  to  M+  are  considered  (a  requirement  of  most 
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Algorithm  1:  The  linear  MMP  algorithm 

Inputs  :  Example  Paths  Pex,  , ...,  P™, 

Feature  Map  T 

w0  =  0; 

for  j  =  1...A  do 

A4  =  buildCostmap(rej_i,  P) ; 

Pe  =  P*  =  0; 

foreach  PI  do 

P*  =  planLossAugPath (start (PJ) , 

goal(i*), 

AD; 

foreach  x  G  PJ  do 

L  ^e+  =  Pe  +  Fx] 

foreach  x  G  PI  do 

I  F  —  P  +  F  • 

Wj  =  wn- 1  +  J7i[F*  -  Fe-  \wj-i]; 

enf orcePositivityConstraint  {wvF); 
return 

motion  and  path  planners).  This  is  achieved  by  iden¬ 
tifying  F  such  that  wT F  <  0,  and  projecting  w  back 
into  the  space  of  allowable  cost  functions. 

The  MMP  framework  easily  supports  the  use  of 
multiple  example  paths.  Each  example  implies  its 
own  constraints  as  in  (4),  its  own  objective  as  in  (7), 
and  its  own  sub-gradient  as  in  (8).  Updating  the  cost 
weights  can  take  place  either  on  a  per  example  basis, 
or  the  feature  counts  can  be  computed  in  a  batch 
with  a  single  update.  The  latter  is  computationally 
preferable,  as  it  may  result  in  fewer  cost  function  eval¬ 
uations,  and  projections  back  into  the  space  of  allow¬ 
able  cost  functions.  The  final  linear  MMP  algorithm 
is  presented  in  Algorithm  1. 

3.2  MMP  with  Non-Linear  Cost 
Functions 

The  derivation  to  this  point  has  assumed  that  the 
space  of  possible  cost  functions  C  consists  of  all  func¬ 
tions  of  the  form  C(F)  =  wT F.  Extension  to  other, 
more  descriptive  spaces  of  cost  functions  is  possible 
by  considering  (7)  for  any  cost  function  C 

minimize  0[C]  =  AReg(C)  (9) 
+  V  C(FX)  -  min  yZ(C(Fx)  -  Le[x)) 

z — *  P  z — ' 

xePe  |_  xeP 

0[C]  is  now  an  objective  functional  over  a  cost  func¬ 
tion,  and  Reg  represents  a  regularization  functional. 
We  can  now  consider  the  sub-gradient  in  the  space  of 


cost  functions 

\/Of[C]  =  AVRegf[C]+  Y,  sf(fx)  -  E 

xePe  xEP* 

(10) 

P*  =  arg min  V'(C'(PX)  -  Le(x)) 

P  - ' 

x£P 

where  S  is  the  Dirac  delta  at  the  point  of  evaluation. 
Simply  speaking,  the  functional  gradient  is  positive 
at  values  of  F  corresponding  to  states  in  the  example 
path,  and  negative  at  values  of  F  corresponding  to 
states  in  the  current  planned  path.  If  the  paths  both 
contain  a  state  corresponding  to  F,  their  contribu¬ 
tions  cancel. 

Applying  gradient  descent  directly  in  this  space 
would  result  in  an  extreme  form  of  overfitting;  es¬ 
sentially,  it  would  involve  raising  or  lowering  the  cost 
associated  with  specific  values  of  F  encountered  on 
either  path,  and  would  therefore  produce  no  general¬ 
ization  whatsoever.  Instead,  a  different  space  of  cost 
functions  is  considered 

C={C\C  =  YviRi(F),  Rien,  %eR}  (li) 

i 

P  ={R  I  R  :  T  — >  M  A  Reg(R)  <  z vj 

C  is  now  defined  as  the  space  of  weighted  sums  of 
functions  Ri  G  F,  where  7Z  is  a  space  of  functions  of 
limited  complexity  that  map  from  the  feature  space 
to  a  scalar.  Choices  of  7Z  include  linear  functions, 
parametric  functions,  neural  networks,  decision  trees, 
etc.  As  in  gradient  boosting  [Mason  et  ah,  2000],  this 
space  represents  a  limited  set  of  ‘directions’  for  which 
a  small  step  can  be  taken;  the  choice  of  the  direction 
set  in  turn  controls  the  complexity  of  C. 

With  this  new  definition,  a  gradient  descent  update 
takes  the  form  of  projecting  the  functional  gradient5 
onto  the  direction  set  by  finding  the  element  R*  e  7Z 
that  maximizes  the  inner  product  (— ' V(9f[(7],  R*). 
The  maximization  of  the  inner  product  between  the 
functional  gradient  and  the  hypothesis  space  can  be 
understood  as  a  learning  problem: 

R*  =  argma x(—WOf[C],R) 

R 

=  argma  x  ^  —VOf[C]R(Fx) 
xePenP* 

=  argmax  ^  axyxR(Fx)  (12) 

xePe^P* 

ax  =  |  V  0Fx  [C]  |  yx  =  -sgn {^CFx  [C]) 

In  this  form,  it  can  be  seen  that  finding  the  projection 
of  the  functional  gradient  involves  solving  a  weighted 

5 For  the  moment,  the  regularization  term  is  ignored. 
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classification  problem;  the  element  of  7 Z  that  best 
discriminates  between  features  vectors  for  which  the 
cost  should  be  raised  or  lowered  maximizes  the  in¬ 
ner  product.  Alternatively,  defining  7Z  as  a  class  of 
regressors  adds  an  additional  regularization  to  each 
individual  P*  [Ratliff  et  ah,  2009].  Intuitively,  the  re¬ 
gression  targets  yx  are  positive  in  regions  of  the  fea¬ 
ture  space  that  the  planned  path  visits  more  than  the 
example  path  (indicating  a  desire  to  raise  the  cost), 
and  negative  in  regions  that  the  example  path  visits 
more  than  the  planned  path.  Each  regression  target 
is  weighted  by  a  factor  ax  based  on  the  magnitude  of 
the  functional  gradient. 

In  comparison  to  the  linear  MMP  formulation,  this 
approach  can  be  understood  as  trying  to  minimize  the 
error  in  visitation  counts  instead  of  feature  counts. 
For  a  given  feature  vector  F  and  path  P,  the  visita¬ 
tion  count  U  is  the  cumulative  count  of  the  number 
of  states  x  G  P  such  that  Fx  =  F.  The  visitation 
counts  can  be  split  into  positive  and  negative  com¬ 
ponents,  corresponding  to  the  current  planned  and 
example  paths.  Formally 

U+(F)  =  5f(Fx) 

xeP* 

U-(F )  =  £  Sf(Fx ) 

x£Pe 

u(f)  =  u+-u.=  Yi  Me.)  -  F  6* 

xeP*  xePe 

(13) 

Comparing  this  formulation  to  (10)  demonstrates 
that  the  planned  visitation  counts  minus  the  example 
visitation  counts  equals  the  negative  functional  gra¬ 
dient  (ignoring  regularization).  This  allows  for  the 
computation  of  regression  targets  and  weights  purely 
as  a  function  of  the  visitation  counts,  providing  a 
straight  forward  implementation  (Algorithm  2)  mak¬ 
ing  use  of  off  the  shelf  regression  approaches  (repre¬ 
sented  by  Rj). 

A  final  addition  to  this  algorithm  involves  a  slightly 
different  approach  to  optimization.  Gradient  descent 
can  be  understood  as  encouraging  functions  that  are 
’small’  in  the  I2  norm;  by  controlling  the  learning 
rate  77  and  the  number  of  epochs,  it  is  possible  to 
constrain  the  complexity  of  the  learned  cost  function. 
However,  instead  we  can  consider  exponentiated  func¬ 
tional  gradient  descent ,  which  is  a  generalization  of 
exponentiated  gradient  to  functional  gradient  descent 
[Ratliff  et  ah,  2009].  Exponentiated  functional  gradi¬ 
ent  descent  encourages  functions  that  are  ‘sparse’  in 
the  sense  of  having  many  small  values  and  a  few  po¬ 
tentially  large  values.  This  change  results  in  C  being 


Algorithm  2:  The  LEARCH  algorithm 

Inputs  :  Example  Paths  Pg1,  Pe2, ...,  P™, 

Feature  Map  T 

C0  =  1; 

for  j  =  1...K  do 

At  =  buildCostmapCCj-i,  P) ; 

U+  =  P_  =  0; 

foreach  PI  do 

PI  =  planLossAugPath ( start (PJ) , 

goal  (Pe'); 

M); 

foreach  x  G  Ple  do 

L  U-(FX)  =  U-(FX)  +  1; 

foreach  x  G  P*  do 

L  U+(FX)  =  U+(FX)  +  1-, 

Tf  =  T0  =  Tw  =  0; 

U  =  U+-U-; 

foreach  Tx  such  that  U(FX)  ^  0  do 
Tf  =  Tf  U  Fx, 

T0  =T0  [J  sgn (U(FX)); 

\_TW=TW  U  \U(FX)[, 

Rj  =  trainWeightedRegressor (Tf,  Ta,  Tw) ; 
_Cj=Cj- 

return  Ck 


redefined  as 

C={C\C  =  e^riiRiiF),  Ri  e  71,  raeR}  (14) 

Another  beneficial  effect  of  this  redefined  space  is 
that  C  naturally  maps  to  M+  without  any  need  for 
projecting  the  result  of  each  gradient  descent  update 
into  the  space  of  valid  cost  functions.  This  final  algo¬ 
rithm  for  non-linear  inverse  optimal  control  is  called 
LEARCH  [Ratliff  et  ah,  2009,  Silver  et  ah,  2008]  and 
is  presented  in  Algorithm  2.  An  example  of  the  algo¬ 
rithm  in  action  is  presented  in  Figure  5. 

It  should  be  noted  that  while  seemingly  sim¬ 
ilar,  LEARCH  is  fundamentally  different  from 
supervised  classification  approaches  presented  in 
[Thrun  et  ah,  2006,  Sun  et  ah,  2007].  While  exam¬ 
ples  of  ’good’  terrain  are  collected  in  a  similar  man¬ 
ner,  these  approaches  simply  assume  that  terrain  near 
where  the  vehicle  was  demonstrated  driving  should 
be  labeled  as  ’bad’;  the  assumption  is  that  a  clas¬ 
sifier  will  be  able  to  deal  with  the  noisy  labeling. 
In  contrast,  LEARCH  determines  a  set  of  states  for 
which  the  total  cost  must  be  increased;  otherwise  the 
demonstration  would  have  traversed  through  those 
states.  Essentially,  negative  examples  of  where  to 
drive  are  implied  by  where  the  demonstrator  explic¬ 
itly  chose  not  to  drive,  rather  than  simply  nearby  re¬ 
gions  that  the  demonstrator  could  have  driven.  Addi- 
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Figure  5:  An  example  of  the  LEARCH  algorithm  learning  to  interpret  satellite  imagery  (Top)  as  costs 
(Bottom).  Brighter  pixels  indicate  higher  cost.  As  the  cost  function  evolves  (left  to  right),  the  current  plan 
(green)  recreates  more  and  more  of  the  example  plan  (red).  Quickbird  imagery  courtesy  of  Digital  Globe, 
Inc.  Images  cover  approximately  300  m  X  250  m. 


Figure  6:  Generalization  of  the  LEARCH  algorithm.  The  cost  function  learned  from  the  single  example  in 
Figure  5  generalizes  over  terrain  never  seen  during  training  (shown  at  approximately  1/2  scale)  resulting  in 
similar  planner  behavior.  3  sets  of  waypoints  (Left)  are  shown  along  with  the  corresponding  paths  (Center) 
planned  under  the  learned  cost  function  (Right). 


tionally,  the  terrain  that  the  demonstration  traversed 
is  not  explicitly  considered  as  ’good’  terrain;  rather 
its  costs  are  only  lowered  until  the  path  is  preferred 
(for  specified  waypoints);  there  could  still  be  high  cost 
regions  along  it.  This  distinction  allows  LEARCH  to 
generalize  well  over  areas  for  which  is  was  not  explic¬ 
itly  trained  (Figure  6). 

4  Adaptation  and  Application 
to  Mobile  Robotic  Systems 

4.1  Extension  to  Dynamic  and  Un¬ 
known  Environments 

The  previous  derivation  of  MMP  and  LEARCH  only 
considered  the  scenario  where  the  mapping  from 
states  to  features  is  static  and  fully  known  a  pri¬ 
ori.  Recent  work  [Silver  et  ah,  2009b]  has  extended 
the  LEARCH  algorithm  to  the  scenario  where  nei¬ 
ther  of  these  assumptions  holds,  such  as  when  fea¬ 


tures  are  generated  from  a  mobile  robot’s  perception 
system.  The  limited  range  inherent  in  onboard  sens¬ 
ing  implies  a  great  deal  of  the  environment  may  be 
unknown;  for  truly  complex  navigation  tasks,  the  dis¬ 
tance  between  waypoints  is  generally  at  least  one  or 
two  orders  of  magnitude  larger  than  the  sensor  range. 
Further,  changing  range  and  point  of  view  from  envi¬ 
ronmental  structures  means  that  even  once  an  object 
is  within  range,  its  perceptual  features  are  continu¬ 
ally  changing.  Finally,  there  are  the  actual  dynamics 
of  the  environment:  objects  may  move,  lighting  and 
weather  conditions  can  change,  etc. 

Since  onboard  perceptual  inputs  are  not  static,  a 
robot’s  current  plan  must  be  continually  recomputed. 
The  original  MMP  constraint  must  be  altered  in  the 
same  way:  rather  than  enforcing  the  optimality  of 
the  entire  example  behavior  once,  the  optimality  of  all 
example  behavior  must  be  continually  enforced  as  the 
current  plan  is  recomputed.  Formally,  we  add  a  time 
index  t  to  account  for  dynamics.  Flx  represents  the 
perceptual  features  of  state  x  at  time  t.  Pi  represents 
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the  example  behavior  starting  from  the  current  state 
at  time  t  to  the  goal,  with  associated  loss  function 
L\.  The  objective  becomes 


minimize  0[C]  =  AReg(C) 


£  £  C(F‘)  -  min 

t  \xept  ^ 


E  (C^)  ~ 

xeP1 


(15) 


the  new  functional  gradient  is 


V0F[C }  =  e  E  ~  E  M*£) 

t  \x£Pt  x£P* 


(16) 


Pi  =  arg  min 


E  (ae')  -  lux)) 

L  xePf 


The  cost  function  C  does  not  have  a  time  index:  the 
optimization  is  searching  for  the  single  cost  function 
that  best  reproduces  example  behavior  over  an  entire 
time  sequence. 

It  is  important  to  clarify  exactly  what  Pi  repre¬ 
sents.  Until  now,  the  terms  plan  and  behavior  have 
been  interchangeable.  This  is  true  in  the  static  case 
since  the  environment  never  evolves;  as  long  as  a  plan 
is  sufficiently  followed,  it  does  not  need  to  be  recom¬ 
puted.  However,  in  the  dynamic  case,  an  expert’s 
plan  and  behavior  are  different  notions:  the  plan  is 
the  currently  intended  future  behavior,  and  the  be¬ 
havior  is  the  result  of  previous  plans.  Therefore,  Pi 
would  ideally  be  the  expert’s  plan  at  time  £,  not  ex¬ 
ample  behavior  from  time  t  onwards. 

However,  this  information  is  generally  not  avail¬ 
able:  it  would  require  the  recording  of  an  expert’s 
instantaneous  plan  at  each  point  in  time.  Even  if  a 
framework  for  such  a  data  collection  were  to  be  im¬ 
plemented,  it  would  turn  the  collection  of  training 
data  into  an  extremely  tedious  and  expensive  pro¬ 
cess.  Therefore,  in  practice  we  approximate  the  cur¬ 
rent  plan  of  an  expert  Pi  with  the  expert’s  behavior 
from  t  onwards.  Unfortunately,  this  approximation 
can  potentially  create  situations  where  the  example 
at  certain  timesteps  is  suboptimal  or  inconsistent. 
The  consequences  of  this  inconsistency  and  possible 
solutions  are  discussed  in  Section  4.2  (see  Figure  9). 

Once  dynamics  have  been  accounted  for,  the  lim¬ 
ited  range  of  onboard  sensing  can  be  addressed.  At 
time  £,  there  may  be  no  perceptual  features  available 
corresponding  to  the  (potentially  large)  section  of  the 
example  path  that  is  outside  of  current  perceptual 
range.  In  order  to  perform  long  range  navigation, 
a  mobile  robotic  system  must  already  have  some  ap¬ 
proach  to  planning  through  terrain  it  has  not  directly 


sensed.  Solutions  include  the  use  of  prior  knowledge 
[Silver  et  ah,  2008],  extrapolation  from  recent  experi¬ 
ence  [Urmson  et  ah,  2006],  or  simply  to  assume  uni¬ 
form  properties  of  unknown  terrain. 

Therefore,  we  define  the  set  of  visible  states  at  time 
t  as  Vt.  The  exact  definition  of  visible  depends  on 
the  specifics  of  the  underlying  robotic  system’s  data 
fusion:  Vt  should  include  all  states  for  which  the  cost 
of  state  x  at  time  t  is  computed  with  the  cost  function 
currently  being  learned,  C  6.  For  all  other  states 
V*,  we  can  assume  the  existence  of  some  alternate 
function  for  computing  cost,  Cy{pc)\  again  this  could 
be  as  simple  as  a  constant. 

Since  we  have  explicitly  defined  Vt  as  the  set  of 
states  at  time  t  for  which  C  is  the  correct  mechanism 
for  computing  cost,  the  cost  of  a  general  path  Pl  is 
now  computed  as 

E  c(fd  +  e  cvW 

xePtnvt  xEPtr\Vt 

It  is  important  to  note  that  Cy(x)  is  not  dependent 
on  F^.  With  this  new  formulation  for  the  cost  of  a 
path,  the  objective  functional  becomes 

minimize  0[C]  =  AReg(C)  (17) 

+  C(Fl)  +  Cy(x) 

t  \xept  nv*  xeP^nv1 

-  E^(  E  (CiFi)  -  L*e(x)) 

1  \xePtnvt 

+  ^2  ^v(x) 

xePtnvt 

with  functional  gradient 


XjOF[C]  =  E 

t 


E  6PP) 

xeP*  nv* 


E  5f(Fx) 

xeP+nv1 

(18) 


Pl  =  arg  min 

pt 


E  (C(F*)  —  L\{x)) 

xePtnvt 


+  ^2  Cv(x) 

xePtnvt 


6 In  the  case  of  Crusher,  Vt  includes  all  locations  that  are 
within  current  sensor  range,  and  have  been  observed  by  said 
sensors. 
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since  the  gradient  is  computed  with  respect  to  C,  it 
is  only  nonzero  for  visible  states;  the  C^(x)  terms 
disappear.  However,  Cy  still  factors  into  the  planned 
behavior,  and  therefore  does  affect  the  learned  com¬ 
ponent  of  the  cost  function.  Just  as  LEARCH  learns 
C  to  recreate  desired  behavior  when  using  a  specific 
planner,  it  learns  C  to  recreate  behavior  when  us¬ 
ing  a  specific  Cp.  However,  if  the  example  behavior 
is  inconsistent  with  Cp,  it  will  be  more  difficult  for 
the  planned  behavior  to  match  the  example.  Such 
an  inconsistency  could  occur  if  the  expert  has  differ¬ 
ent  prior  knowledge  than  the  robot,  or  interprets  the 
same  knowledge  differently.  Inconsistency  can  also 
occur  due  to  the  previously  discussed  mismatch  be¬ 
tween  expert  plans  and  expert  behavior.  A  solution 
to  this  problem  is  discussed  in  Section  4.2. 

The  projection  of  the  functional  gradient  onto  the 
hypothesis  space  becomes 

i?»  =  argmaxy]  {  J  (19) 

t  yxeCPeUPAnv*  / 

Contrasting  the  final  form  for  R *  with  that  of 
(12)  helps  to  summarize  the  changes  that  result  in 
the  LEARCH  algorithm  for  dynamic  environments. 
Specifically,  a  single  expert  demonstration  from  start 
to  goal  is  discretized  by  time,  with  each  timestep  serv¬ 
ing  as  an  example  of  what  behavior  to  plan  given  all 
data  to  that  point  in  time.  For  each  of  these  dis¬ 
cretized  examples,  only  visitation  counts  in  visible 
states  are  used.  The  resulting  algorithm  is  presented 
as  Algorithm  3. 

A  final  detail  for  a  LEARCH  implementation  is  the 
source  of  the  input  perceptual  features.  Rather  than 
computing  and  logging  these  features  online,  it  is  use¬ 
ful  to  record  all  raw  sensor  data,  and  then  to  compute 
the  features  by  simulating  perception  offline.  This 
allows  existing  expert  demonstration  to  be  reused  if 
new  feature  extractors  are  added,  or  existing  ones 
modified;  perception  is  simply  re-simulated  to  pro¬ 
duce  the  new  inputs.  In  this  way,  learning  a  cost 
function  when  the  perception  system  is  modified  re¬ 
quires  no  additional  expert  interaction. 

4.2  Imperfect  and  Inconsistent 
Demonstration 

The  MMP  framework  implicitly  assumes  that  one  or 
more  cost  functions  exist  under  which  demonstrated 
behavior  is  near  optimal.  Generally  this  is  not  the 
case,  as  there  will  always  be  noise  in  human  demon¬ 
stration.  Further,  multiple  examples  possibly  col¬ 
lected  from  different  environments  and  different  ex¬ 
perts  may  be  inconsistent  with  each  other  (due  to 


Algorithm  3:  The  dynamic  LEARCH  algorithm 

Inputs  :  Example  Behaviors  P* ,  P2 , ...,  P™, 

Sensor  Histories  P^P2,  ...,Pn,  Cost 
Map  Cp 

Cb  =  i; 

foreach  PJ  do 

for  r  =f  irstTime(Pg)  :  At  :lastTime  (PJ) 
do 

pr,i  _ 
e 

extractPathSeg(Pg,  r,lastTime  (PJ)  )  ; 

[Jrr'i,Vr'i]  = 

simPerception(P2,f  irstTime  (PJ)  ,r) ; 

for  j  =  1...K  do 
U+=U-=  0; 
foreach  P do 

M*'*  = 

buildCostmap(Ci_i,  Pt,%Vt,%Cy) ; 

P*1  =  planLossAugPath  ( start  (Pg,z) ; 

goal  (Pg’G , 

A4M); 

foreach  x  G  P^1  fj  Vt,z  do 

L  U-(T/)  =  U-W)  +  1; 

foreach  x  G  P1*1  p|  Vt,x  do 

L  U+(J*>i)  =  U+(J*’i)  +  1; 

Tf  =  T0  =  Tw  =  0; 

P  =  P+-P_; 

foreach  such  that  U(P^1)  ^  0  do 
Tf  =  Tf  U 

T0  =T0  [J  sgn(P(pM)); 

Ltw  =  tw\j  |P(P^)I; 

Rj  =  trainWeightedRegressor  (Tj,  T0,  Tw) ; 
_  C:j  =  Cj-i  *  er,j Rj ; 

return  Ck 


inconsistency  in  human  behavior,  a  different  concept 
of  what  is  desirable,  or  an  incomplete  perceptual  de¬ 
scription  of  the  environment  by  the  robot.)  Finally, 
sometimes  experts  are  flat  out  wrong,  and  demon¬ 
strate  behavior  that  is  not  even  close  to  desirable. 

While  the  MMP  framework  is  robust  to  poor  train¬ 
ing  data,  it  does  suffer  degraded  overall  performance 
and  generalization,  in  the  same  way  that  supervised 
classification  performance  is  degraded  by  noisy  or 
mislabeled  training  data.  Attempting  to  have  an  ex¬ 
pert  sanitize  the  training  input  after  initial  demon¬ 
stration  is  disadvantageous  for  two  reasons.  First  it 
creates  an  additional  need  for  human  involvement, 
eliminating  some  of  the  time  savings  promised  by 
this  approach.  Second,  it  assumes  that  an  expert 
can  detect  all  errors;  while  this  may  be  true  for  ex¬ 
treme  cases,  a  human  expert  may  be  no  more  capa- 
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Algorithm  4:  The  dynamic  LEARCH  algorithm 
with  example  replanning  and  weight  balancing 

Inputs  :  Example  Behaviors  Pex,  , ...,  P™, 

Sensor  Histories  7Y1,  7Y2, ...,  TLn,  Cost 
Map  Cp,  Corridor  width  [3 

Co  =  l; 

foreach  Ple  do 

for  r  =f  irstTime  (PJ)  :  At  :lastTime  (PJ) 
do 

nr,i  _ 

r  e 

extractPathSeg(Pg,  r,lastTime(Pg)  ) ; 

\T  r'\VT'i]** 

simPerception(P2,f  irstTime (Pg)  ,t)  5 

for  j  =  1...K  do 
U  +=  U-  =  0; 
foreach  P*’2  do 

M*'*  = 

buildCostmap(Ci_i,  Pt,%Vt,CCy) ; 

Pj,z  =  planLossAugPath  (start  (Pg,z) ; 

goal(P^), 

= 

buildCorridorCostmap  (Aft,%/3;Vt’z) ; 
Pgf  =  replanExample  (start  (Pg,2); 

goal  (Pg’2) , 

foreach  x  G  Pgf  p|  Vt,z  do 

L  p_(P^2)  =  t/-(^)  +  i; 

foreach  x  G  P\'1  p|  Vt  2  do 

L  p+(p^)  =  p+(^)  +  i; 

Tf  =  T0  =  Tw  =  0; 

P  =  P+-P_; 

7V+  =  AT  =  0; 

foreach  P*’2  sitc/i  that  P(P*’2)  ^Odo 
Tf  =  Tf  U 

T0  =T0  U  sgn(P(P^2)); 

u  l^(^)l; 

if  sgn(P(P^’2))  >  0  then  7V+  =  7V+  +  1 
else  7V_  =  N_  +  1; 

foreach  (tQ,tw)  G  (T0,TW)  do 

if  tQ  >  0  then  tw  =  tw/N+  else 
_  tw  / N  —  | 

Rj  =  trainWeightedRegressor (Tj,  T0,  Tw) ; 

_  Cj  =  Cj-1  *  er]j  Rj : 

return  Ck 


ble  of  identifying  small  amounts  of  noise  than  he  is 
of  preventing  that  noise  in  the  first  place.  Even  if  de¬ 
tecting  and  filtering  out  noisy  demonstration  is  au¬ 
tomated  (as  in  Section  4.2),  removing  all  imperfect 


demonstration  would  remove  a  large  percentage  of 
available  training  data.  This  would  greatly  increase 
the  amount  of  effort  that  must  be  expended  to  pro¬ 
duce  a  viable  training  set;  it  may  also  remove  example 
demonstrations  from  which  something  could  still  have 
been  learned.  Therefore,  a  practical  and  robust  learn¬ 
ing  approach  must  be  able  to  handle  a  reasonable 
amount  of  error  in  provided  demonstration  without 
significantly  degraded  performance.  The  rest  of  this 
section  describes  modifications  to  the  LEARCH  algo¬ 
rithm  that  can  increase  robustness  and  improve  gen¬ 
eralization  in  the  face  of  noisy  or  poor  expert  demon¬ 
stration. 

4.2.1  Unachievable  Example  Behaviors 

Experts  do  not  necessarily  plan  their  example  behav¬ 
ior  in  a  manner  consistent  with  a  robot’s  planning 
system:  this  assumption  is  not  part  of  the  MMP 
framework.  However,  what  is  assumed  is  that  there 
exists  at  least  one  allowable  cost  function  that  will 
cause  the  robot’s  planner  to  reproduce  demonstrated 
behavior  (by  scoring  said  behavior  as  the  minimum 
cost  plan).  Unfortunately,  this  is  not  always  the 
case:  it  is  possible  for  an  example  to  be  unachiev¬ 
able.  An  unachievable  example  is  one  such  that  no 
consistent  cost  function,  when  applied  to  the  avail¬ 
able  perceptual  feature  representation,  will  result  in 
the  specified  planning  system  reproducing  the  exam¬ 
ple  demonstration.  For  example,  an  expert  may  give 
an  inconsistently  wide  berth  to  obstacles,  or  make 
wider  turns  than  are  necessary.  Perhaps  the  most  in¬ 
tuitive  case  is  if  an  expert  turns  left  around  a  large 
obstacle,  when  turning  right  would  have  been  slightly 
shorter.  The  result  is  that  experts  often  take  slightly 
longer  routes  through  similar  terrain  than  are  opti¬ 
mal  [Silver  et  ah,  2008];  depending  on  planner  details 
(such  as  c-space  expansion  and  dynamic  constraints) 
such  examples  are  often  unachievable. 

It  is  instructive  to  observe  what  happens  to  the 
functional  gradient  with  an  unachievable  example. 
Imagine  a  section  of  an  environment  where  all  states 
are  described  by  the  identical  feature  vector  Fr .  Un¬ 
der  this  scenario,  (10)  reduces  to 


VOF*[C\ 


SccGPe  1  SzEP*  1  if  F  —  F' 
0  if  F^F' 


The  functional  gradient  depends  only  on  the  lengths 
of  the  example  and  current  plan,  independent  of  the 
cost  function.  If  the  paths  are  not  of  equal  length, 
then  the  optimization  will  never  be  satisfied.  Specif¬ 
ically,  if  the  example  path  is  too  long,  there  will  al¬ 
ways  be  an  extra  component  of  the  gradient  that  at- 
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(a)  3  Example  Training  Paths  (b)  Learned  costmap  with  un-  (c)  Learned  costmap 

balanced  weighting  balanced  weighting 


with  (d)  Ratio  of  the  balanced  to 
unbalanced  costmaps 


Figure  7:  The  red  path  is  an  unachievable  example  path,  as  it  will  be  less  expensive  under  any  cost  function 
to  cut  more  directly  across  the  grass.  With  standard  unbalanced  weighting  (b),  the  unachievable  example 
forces  down  the  cost  of  grass,  and  prevents  the  blue  example  from  being  achieved.  Balanced  weighting  (c) 
prevents  this  bias,  and  the  blue  example  is  achieved.  Overall,  grass  is  approximately  12%  higher  cost  with 
balanced  than  unbalanced  weighting  (d) 


tempts  to  lower  costs  at  F' .  Intuitively,  an  unachiev¬ 
able  example  implies  that  the  cost  of  certain  terrain 
should  be  0,  as  this  would  result  in  any  path  through 
that  region  being  optimal.  However,  since  costs  are 
constrained  to  M+,  this  will  never  be  achieved.  In¬ 
stead  an  unachievable  example  will  have  the  effect  of 
unnecessarily  lowering  costs  over  a  large  section  of 
the  feature  space,  and  artificially  reducing  dynamic 
range.  Depending  on  the  expressiveness  of  7 Z,  an 
unachievable  example  counteracts  the  constraints  of 
other  (achievable)  paths,  resulting  in  poorer  perfor¬ 
mance  and  generalization  (see  Figure  7). 

This  negative  effect  can  be  avoided  by  performing 
a  slightly  different  regression  or  classification  when 
projecting  the  gradient.  Instead  of  minimizing  the 
weighted  error,  the  balanced  weighted  error  is  min¬ 
imized;  that  is,  both  positive  and  negative  targets 
make  an  equal  contribution.  Formally,  in  (19)  R*  is 
replaced  with  Rf  defined  as 


otcRjFl) 

N+ 


-E 

y%<° 


ajRjFt) 
N- 


R f  —  arg  max  EE 
*  \j4>o 

r+  =  EE«*  =  i^+ii  ^-  =  EEa'  =  p-i> 

t  ylx>  0  t  y\.<  0 

(20) 


In  the  extreme  unachievable  case  described  above, 
Rf  will  be  zero  everywhere;  the  optimization  will  be 
satisfied  with  the  cost  function  as  is.  The  effect  of  bal¬ 
ancing  in  the  general  case  can  be  observed  by  rewrit¬ 
ing  the  regression  operation  in  terms  of  the  planned 
and  example  visitation  counts,  and  observing  the  cor¬ 
relation  of  their  inputs. 


Theorem  4.1.  The  regression  targets  of  R*  and 
R f  are  always  correlated ,  except  when  the  visitation 
counts  between  the  example  and  planned  path  are  per¬ 
fectly  correlated. 


Proof 


(U+  -  U- 


U- 

nZ 


) 


_  (U+,U+) 

N+ 

.  \u±l  I 
N+  + 


(U+W-)  (U-,U~)  T+V-) 

N+  N_  N_ 


\U-\2 

N_ 


(21) 


By  the  Cauchy- Schwarz  inequality,  (£/+,[/_)  is 
bounded  by  |[/+||[/_|,  and  is  only  tight  against  this 
bound  when  the  visitation  counts  are  perfectly  corre¬ 
lated,  which  implies 


(U+,U~)  =  \U+\\U-\ ^u-  =  KU+ 
= =>  \U-\  =  k\U+\  ,  N-  =  kN+ 

for  some  scalar  k.  By  substitution 

V+l  +  \ILl  _  (ZL  +  ZL)(U  n  x 

N+  +  N_  %++AT_^  +’ 


>\U±l+\U- 


N+ 

\u+l 

N+ 

I y+l 

iv+ 


N- 


+ 


>fZu+?_ 

kN+ 

* \U+l 

N+ 


-(1 

\u+l 

N+ 


1 


kN+ 

« \y+l 

N+ 


k\U+\\U+\ 


=  0 


-R*  =  argmax(R,  U+  —  U-) 

i?f  =  arg  max  (R,  %  _  -E\ 
*  '  1 N+  N_ ' 


When  (U+,  U-)  is  not  tight  against  the  upper  bound 


(U+-U-, 


y+ 

N+ 


>  0 
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By  (21)  the  similarity  between  inputs  to  the  projec¬ 
tions  is  negatively  correlated  to  the  overlap  of  the  pos¬ 
itive  and  negative  visitation  counts.  When  there  ex¬ 
ists  clear  differentiation  between  what  features  should 
have  their  costs  increased  and  decreased,  the  projec¬ 
tion  inputs  will  be  similar.  As  the  example  and  cur¬ 
rent  planned  behaviors  travel  over  increasingly  sim¬ 
ilar  terrain,  the  inputs  being  to  diverge;  the  contri¬ 
bution  of  the  balanced  projection  to  the  current  cost 
function  will  level  out,  while  that  of  the  unbalanced 
projection  will  increase  in  the  direction  of  the  longer 
path.  Finally,  in  a  fully  unachievable  case,  the  bal¬ 
anced  projection  will  zero  out,  while  the  unbalanced 
would  drive  the  cost  in  the  direction  of  the  more  dom¬ 
inant  class.  This  effect  is  observed  empirically  in  Sec¬ 
tion  5.  The  implementation  of  this  balancing  is  shown 
in  Algorithm  4. 


4.2.2  Noisy  Demonstration:  Replanning  and 
Corridor  Constraints 


minimize  0[C]  =  AReg[C]  (22) 
subject  to  the  constraint 

£  (C(FX)  -  Le (*))  >  £  C(FX) 

xeP*  xeP* 

P*  =  argmin  ^^(C(FX)  -  Le(x)) 

xEP 

P*  =  arg  min  ]T  C(FX) 

PeNe 

x  vnP 

Instead  of  enforcing  that  Pe  is  optimal,  the  new  con¬ 
straint  is  to  enforce  that  P*  is  optimal,  where  P*  is 
the  optimal  path  within  some  set  of  paths  Me-  The 
definition  of  Me  determines  how  ’close’  is  defined.  Us¬ 
ing  the  above  example  of  a  corridor  in  a  Euclidean 
space,  Me  would  be  defined  as 

Me  =  {P  I  Vx  e  P  3y  e  Pe  s.t.  I \x  -  y\ I  <  (3} 

with  f3  defining  the  corridor  width.  In  the  general 
case,  this  definition  can  always  be  rewritten  in  terms 
of  the  loss  function 

Ne  =  {P  |  Vx  e  P  3y  e  Pe  s.t.  L(x,  y )  <  /3 } 


A  balanced  regression  can  help  to  account  for  large 
scale  sub-optimality  in  human  demonstration.  How¬ 
ever,  sub-optimality  can  also  occur  at  a  smaller  scale. 
It  is  unreasonable  to  ever  expect  a  human  to  drive  or 
demonstrate  the  exact  perfect  path;  it  is  often  the 
case  that  a  plan  that  travels  through  neighboring  or 
nearby  states  would  be  a  slightly  better  example.  In 
some  cases  this  example  noise  translates  to  noise  in 
the  cost  function;  in  more  extreme  cases  it  can  signifi¬ 
cantly  affect  performance  (Figure  8).  What  is  needed 
is  an  approach  that  smoothes  out  small  scale  noise 
in  expert  demonstration,  producing  a  better  training 
example. 

Such  a  smoothed  example  can  be  derived  from 
expert  demonstration  by  redefining  the  MMP  con¬ 
straint:  instead  of  example  behavior  being  inter¬ 
preted  as  the  exact  optimal  behavior,  it  can  be  in¬ 
terpreted  as  a  behavior  that  is  spatially  near  the  op¬ 
timal  path.  The  exact  definition  of  ’close’  depends 
on  the  state  space;  the  loss  function  will  always  pro¬ 
vide  at  least  one  possible  metric.  If  the  state  space 
is  Mn,  then  Euclidean  distance  is  a  natural  metric. 
Therefore,  rather  than  an  example  defining  the  exact 
optimal  path,  it  would  define  a  corridor  in  which  the 
optimal  path  exists. 

Redefining  the  original  MMP  constraint  in  (4)  in 
this  way  yields 


It  is  important  to  note  that  the  definition  of  Ne  is 
only  dependent  on  individual  states.  Therefore,  P* 
can  be  found  by  an  optimal  planner,  simply  by  only 
allowing  traversal  through  states  that  meet  the  loss 
threshold  (3  with  respect  to  some  state  in  Pe. 

Reformulating  (22)  as  an  optimization  problem 
yields  the  following  objective 


minimize  0[C]  =  AReg(C) 


mm 

Pee  Me 


mm 

P 


£  C(FX) 
cePe 

J2(C(FX)  -  Le(x)) 


x£p 


(23) 


The  resulting  change  in  the  LEARCH  algorithm  is  to 
carry  through  the  extra  minimization  to  the  compu¬ 
tation  of  the  visitation  counts.  That  is,  at  every  iter¬ 
ation,  a  new,  smoothed,  example  is  chosen  from  with 
Me]  example  visitation  counts  are  computed  with  re¬ 
spect  to  this  path.  This  new  step  is  shown  in  Algo¬ 
rithm  4. 

It  should  be  noted  that  as  a  result  of  this  addi¬ 
tional  min  term,  the  objective  is  no  longer  convex. 
It  is  certainly  possible  to  produce  individual  exam¬ 
ples  where  such  a  smoothing  step  can  result  in  poor 
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(a)  Example  Paths  (b)  Planned  Paths  (No  Replanning)  (c)  Planned  Paths  (With  Replanning) 


Figure  8:  An  example  of  how  noisy  demonstration  can  hurt  performance.  The  red  and  green  example  paths 
in  (a)  are  slightly  too  close  to  trees,  preventing  the  cost  of  the  trees  from  increasing  sufficiently  to  match 
the  red  example  (map  (b)).  However,  if  the  paths  are  allowed  to  be  replanned  within  a  corridor,  the  red 
and  green  path  are  essentially  smoothed,  allowing  the  cost  of  trees  to  get  sufficiently  high  (map  (c)).  On 
average,  the  trees  achieve  three  times  the  cost  in  (c)  as  in  (b). 


local  minima;  however,  it  has  been  observed  empiri¬ 
cally  that  this  effect  is  neutralized  when  using  multi¬ 
ple  examples.  The  experimental  performance  of  this 
smoothing  step  is  presented  in  Section  5. 

When  operating  with  dynamic  and  partially  un¬ 
known  perceptual  data,  this  replanning  step  provides 
another  important  side  effect.  Rewriting  (17)  with 
this  additional  min  term  yields 

minimize  G[C]  —  AReg(C)  (24) 

+  Z  (  E  6W)  +  E  cv(*)] 

t  e  e  VccePjnv*  xept nv*  / 

-  £mm(  E  (C(Ft)  -  Ll{x))  +  J2  CvW 

1  \xePtnvt  xePtnvt  J 

Where  is  the  set  of  paths  near  P\.  However,  as 
before  it  should  be  noted  that  the  Cp  terms  will  have 
no  affect  on  the  functional  gradient.  Therefore,  the 
definition  of  M\  does  not  need  to  consider  states  in 
V.  This  yields  the  following  general  definition  of  M\ 

Ml  =  {P*  |  Va;  €  P*  p|  V*  3y  e  P*  f]  V* 
s.t.  L(x,y)  <  /?} 

the  result  is  that  Ml  only  defines  closeness  over  Vt. 
Behavior  outside  of  Vt  does  not  directly  affect  the 
gradient,  but  does  affect  the  objective  value  (the  dif¬ 
ference  in  cost  between  the  current  (replanned)  exam¬ 
ple  and  planned  behavior) .  Therefore,  by  performing 
a  replanning  step  (even  with  (3  =  0),  example  be¬ 
havior  can  be  made  consistent  with  Cy  without  com¬ 
promising  its  effectiveness  as  an  example  within  V1. 
This  notion  of  consistency  proves  to  have  meaningful 
value. 


Figure  9:  Recorded  example  behavior  from  time  t 
(left)  and  t  +  1  (right),  overlayed  on  a  single  per¬ 
ceptual  feature  (obstacle  height).  Future  behavior  is 
inconsistent  at  time  £,  but  makes  sense  at  time  t  +  1 
given  additional  perceptual  data. 


4.2.3  Filtering  for  Inconsistent  Examples 

One  fundamental  issue  with  expert  demonstration  is 
consistency.  A  human  demonstrator  may  act  approx¬ 
imately  according  to  one  metric  during  one  example, 
and  a  slightly  different  metric  during  another  exam¬ 
ple.  While  each  individual  example  may  be  near- 
optimal  with  respect  to  some  metric,  the  two  exam¬ 
ples  together  may  be  inconsistent;  that  is,  there  is 
no  consistent  cost  function  that  would  define  both 
behaviors  as  optimal. 

The  possibility  of  an  expert  interpreting  unknown 
terrain  in  a  different  manner  is  a  potentially  large 
source  of  inconsistency.  This  is  especially  true  when 
attempting  to  learn  an  online  cost  function,  as  it  is 
very  likely  that  the  demonstrator  will  have  implicit 
prior  knowledge  of  the  environment  that  is  unavail¬ 
able  to  the  perception  system.  However,  by  always 
performing  a  replanning  step  as  previously  discussed, 
example  demonstration  can  be  made  consistent  with 
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the  robot’s  interpretation  of  the  environment  in  un¬ 
observed  regions. 

With  consistency  in  unobserved  regions  accounted 
for,  there  remain  four  primary  sources  of  inconsistent 
demonstration 

•  Inconsistency  between  multiple  experts 

•  Expert  error  (poor  demonstration) 

•  Inconsistency  between  an  expert’s  and  the 
robot’s  perception  in  observed  regions 

•  A  mismatch  between  an  expert’s  planned  and 
actual  behavior 

This  last  issue  was  alluded  to  in  Section  4.1:  while 
an  expert  example  should  consist  of  an  expert’s  plan 
at  time  t  from  the  current  state  to  the  goal,  what  is 
recorded  is  the  expert’s  behavior  from  time  t  to  the 
goal.  Figure  9  provides  a  simple  example  of  this  mis¬ 
match:  at  time  £,  the  expert  likely  planned  to  drive 
straight,  but  was  forced  to  replan  at  time  t- hi  when 
the  cul-de-sac  was  observed.  This  breaks  the  assump¬ 
tion  that  the  expert  behavior  from  time  t  onward 
matches  the  expert  plan;  the  result  is  that  the  dis¬ 
cretized  example  at  time  t  is  inconsistent  with  other 
example  timesteps. 

However,  the  very  inconsistency  of  such  timesteps 
provides  a  basis  for  their  filtering  and  removal. 
Specifically,  it  can  be  assumed  that  a  human  expert 
will  plan  in  a  fairly  consistent  manner  during  a  sin¬ 
gle  example  traverse7.  If  the  behavior  from  a  single 
timestep  or  small  set  of  timesteps  is  inconsistent  with 
the  demonstrated  behavior  at  other  timesteps,  then 
it  is  safe  to  assume  that  this  small  set  of  timesteps 
does  not  demonstrate  correct  behavior,  and  can  be 
filtered  out  and  removed  as  training  examples.  This 
does  not  significantly  affect  the  amount  of  training 
data  required  to  train  a  full  system.,  as  by  definition 
an  inconsistent  timestep  is  unlikely  to  provide  an  ex¬ 
ample  of  an  important  concept. 

Inconsistency  can  be  quantitatively  defined  by  ob¬ 
serving  each  timestep’s  contribution  to  the  objective 
functional  (its  slack  penalty).  In  (5)  this  penalty  is 
explicitly  defined  as  a  a  measurement  of  by  how  much 
a  constraint  remains  violated.  If  the  penalty  at  a  sin¬ 
gle  timestep  of  an  example  behavior  is  a  statistical 
outlier  from  the  distribution  of  slack  penalties  at  all 
other  timesteps,  it  indicates  that  single  timestep  im¬ 
plies  constraints  that  remain  violated  far  more  than 
others.  That  is,  the  constraints  at  an  outlier  timestep 
are  inconsistent  with  those  implied  by  the  rest  of  a 
demonstration. 

7If  this  assumption  does  not  hold,  then  the  very  idea  of 
learning  from  said  expert’s  demonstration  is  flawed 


Therefore,  the  following  filtering  heuristic  is  pro¬ 
posed  as  a  pre-processing  step.  First,  attempt  to 
learn  a  cost  function  over  all  timesteps  of  a  single 
example  behavior  and  identify  statistical  outliers  (ac¬ 
cording  to  slack  penalties).  During  this  step,  a  more 
complex  hypothesis  space  of  cost  functions  should 
be  used  than  is  intended  for  the  final  cost  function 
(i.e  use  more  complex  regressors).  As  these  outlier 
timesteps  are  inconsistent  within  an  overly  complex 
hypothesis  space,  there  is  evidence  that  the  incon¬ 
sistency  is  in  the  example  itself,  and  not  for  lack  of 
expressiveness  in  the  cost  function.  Therefore,  these 
timesteps  should  be  removed.  This  process  can  be 
repeated  for  each  example  behavior,  with  only  re¬ 
maining  timesteps  used  in  the  final  training. 

Aside  from  filtering  out  inconsistency  due  to 
plan/behavior  mismatch,  this  approach  will  also  fil¬ 
ter  timesteps  due  to  other  sources  of  inconsistency. 
This  is  beneficial,  as  long  as  the  timesteps  truly  are 
inconsistent.  However,  the  possibility  always  remains 
that  the  example  itself  was  correct;  it  may  only  ap¬ 
pear  inconsistent  due  to  the  fidelity  of  perception  or 
planning.  In  this  case,  filtering  is  still  beneficial,  as 
the  examples  would  not  have  been  learnable  (with 
the  current  set  of  perceptual  features  and  the  current 
planning  system);  instead,  the  small  subset  of  filtered 
examples  can  be  examined  by  a  system  expert,  who 
may  then  identify  a  necessary  additional  component 
level  capability.  Experimental  results  of  this  filtering 
approach  are  presented  in  Section  5. 

4.3  Application  to  Mobile  Robotic 
Systems 

Before  LEARCH  (in  either  its  static  or  dynamic 
forms)  can  be  applied  to  the  task  of  learning  a  ter¬ 
rain  cost  function  for  a  mobile  robotic  system,  there 
are  still  some  practical  considerations  to  address.  It 
is  important  to  remember  the  specific  task  for  which 
LEARCH  is  intended:  it  is  designed  to  select  a  cost 
function  from  a  defined  hypothesis  space  C  ,  such  that 
expert  demonstration  is  recreated  when  the  cost  func¬ 
tion  is  applied  to  the  specific  perception  and  planning 
systems  for  which  it  was  trained.  There  are  several 
hidden  challenges  in  that  statement,  such  as  defining 
C,  and  ensuring  LEARCH  is  producing  a  cost  func¬ 
tion  for  the  correct  planning  system. 

4.3.1  Selecting  a  Cost  Function  Hypothesis 
Space 

The  cost  function  hypothesis  space  C  is  implicitly  de¬ 
fined  by  the  regressor  space  7 Z.  In  turn,  7 Z  is  defined 
by  design  choices  relating  to  the  family  and  allowable 
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Figure  10:  Example  of  a  new  feature  (right)  learned 
automatically  from  panchromatic  imagery  (left)  using 
only  expert  demonstration  (there  are  no  explicit  class 
labels) . 


complexity  of  regressors.  For  example,  if  single  layer 
neural  networks  with  at  most  H  hidden  are  chosen 
as  the  class  of  regressors,  than  C  consists  of  all  cost 
functions  that  are  a  weighted  sum  of  such  networks. 
In  this  way,  cost  functions  of  almost  arbitrary  com¬ 
plexity  can  be  allowed. 

However,  as  with  most  machine  learning  tech¬ 
niques,  there  is  a  design  tradeoff  between  expressive¬ 
ness  and  generalization.  Complex  regressors  are  ca¬ 
pable  of  expressing  complex  costing  rules,  but  are 
more  prone  to  overfitting.  In  contrast,  simpler  re¬ 
gressors  may  generalize  better,  but  are  also  limited 
in  the  costing  rules  they  can  express.  This  tradeoff 
must  be  effectively  balanced  to  ensure  both  sufficient 
expressiveness  and  generalization.  Fortunately,  as  re¬ 
gression  is  a  core  machine  learning  task,  there  are  well 
developed  and  understood  approaches  for  achieving 
this  balance.  In  practice,  validation  with  an  inde¬ 
pendent  holdout  set  of  demonstrated  behaviors  can 
help  quantify  this  tradeoff.  This  allows  a  range  of 
regressor  types  and  complexities  to  be  automatically 
evaluated;  the  one  with  the  best  holdout  performance 
can  be  selected  with  minimal  additional  human  inter¬ 
action. 

Another  issue  with  respect  to  the  definition  of  C  is 
computational  cost.  In  a  scenario  where  perceptual 
features  are  static,  this  concern  is  not  as  important, 
as  cost  evaluations  are  only  performed  once.  How¬ 
ever,  in  an  online  and  dynamic  setting,  cost  eval¬ 
uations  are  performed  continuously;  computational 
complexity  may  be  a  much  larger  issue.  As  the  final 
learned  cost  function  is  a  weighted  combination  of  K 
regressors,  computational  complexity  of  the  final  cost 
function  is  linear  in  K.  Again,  this  creates  a  design 
tradeoff;  limiting  K  will  limit  the  computational  cost 
per  evaluation,  but  will  also  limit  the  accuracy  of  the 


cost  function  (as  the  final  steps  of  a  gradient  descent 
operation  fine  tune  the  solution). 

One  solution  is  to  define  7 Z  as  the  space  of  linear 
functions.  Since  the  weighted  sum  of  linear  func¬ 
tions  is  another  linear  function,  using  this  defini¬ 
tion  of  7Z  would  result  in  the  final  complexity  of 
Ck  being  constant  with  respect  to  K.  However  us¬ 
ing  linear  regressors  with  LEARCH  results  in  almost 
the  same  solution  as  would  be  produced  by  the  lin¬ 
ear  MMP  algorithm  (but  not  identical  due  to  expo¬ 
nentiated  functional  gradient  descent).  As  the  fun¬ 
damental  advantage  of  the  LEARCH  approach  was 
to  allow  non-linear  cost  functions,  this  would  seem 
to  imply  a  necessary  tradeoff.  However,  the  addi¬ 
tion  of  a  feature  learning  phase  [Ratliff  et  ah,  2007, 
Silver  et  ah,  2008,  Silver  et  ah,  2009a]  to  LEARCH 
can  potentially  solve  this  problem.  During  most 
learning  iterations,  linear  regressors  are  used.  How¬ 
ever,  when  it  is  detected  that  the  objective  error  is  no 
longer  decreasing,  a  single,  simple  non-linear  regres¬ 
sion  is  performed.  This  single,  non-linear  step  would 
better  discriminate  between  terrains  that  are  difficult 
for  a  purely  linear  regressor.  However,  rather  than 
add  this  new  regressor  directly  into  the  cost  func¬ 
tion,  it  is  instead  treated  as  a  new  feature.  Figure  10 
provides  a  simple  example  of  a  learned  feature  with 
only  greyscale  imagery  as  input. 


4.3.2  Planner  Interpretation  of  Cost  Maps 

A  common  approach  in  mobile  robotics  is  to  treat  the 
environment  as  a  2.5D  space;  this  allows  the  state 
space  S  for  high  level  path  planning  to  be  M2.  This 
results  in  terrain  costs  defined  over  a  a  discretized 
2D  grid.  However,  different  planning  systems  may 
interpret  a  2D  cost  grid  differently.  Since  the  goal  is 
to  learn  a  cost  function  to  recreate  behavior  with  a 
specific  planning  system,  these  details  must  be  taken 
into  account  to  learn  the  cost  function  correctly. 

Perhaps  the  simplest  cost-aware  planner  one  might 
use  for  a  mobile  robot  would  be  4-connected  A*  (or 
other  grid  planner).  Such  a  planning  system  would 
incur  a  cost  of  C(x)  whenever  a  cell  x  was  traversed. 
Now  consider  the  case  of  an  8-connected  A*.  Many 
8-connected  implementations  treat  traversing  a  cell 
diagonally  as  higher  cost  than  traversing  the  same 
cell  axis-aligned;  this  allows  the  planner  to  take  into 
account  the  extra  distance  traversed  through  a  grid. 
This  usually  takes  the  form  of  incurring  a  cost  of  C(x) 
when  traversing  the  cell  axis-aligned,  and  a  cost  of 
a/2 C(x)  when  traversing  the  cell  diagonally. 

Since  cells  traversed  diagonally  incur  more  cost, 
this  must  be  taken  into  account  by  LEARCH  when 
computing  the  projection  of  the  functional  gradient. 
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With  respect  to  the  final  cost  of  a  path,  a  cell  tra¬ 
versed  diagonally  will  have  a  \f2  greater  affect  than 
one  traversed  axis-aligned;  therefore  it  is  \[2  times 
more  important  to  get  the  sign  right  on  the  projec¬ 
tion.  The  solution  is  to  increment  the  visitation  count 
of  a  cell  by  1  when  traversing  it  axis-aligned,  and  by 
y/2  when  traversing  diagonally.  In  the  general  case, 
a  planner  may  incur  a  cost  of  dC  ( x )  when  traversing 
a  distance  d  through  cell  x\  the  visitation  count  of 
state  x  should  then  be  incremented  by  d. 

Another  issue  that  must  be  considered  is  that  of 
configuration  space  expansion.  Motion  planners  for 
mobile  robotic  systems  often  apply  a  c-space  expan¬ 
sion  to  an  input  cost  map  before  generating  a  plan, 
in  order  to  account  for  the  physical  dimensions  of  the 
robot.  The  result  is  that  the  cost  the  planner  assigns 
for  traversing  distance  d  through  state  x  is  no  longer 
dC(x ),  but  rather  something  along  the  lines  of 

dJ2w(x,y)C(y)  (25) 

yeJV 

where  N  is  a  set  of  states  sufficiently  close  to  x,  and 
IT  is  a  weighting  function.  Common  choices  for  IT  in¬ 
clude  a  constant,  or  a  linear  falloff  based  on  |  \x  —  y\  |. 
As  before,  this  weighting  must  be  captured  by  the 
visitation  counts:  if  distance  d  is  traversed  through 
cell  x,  then  all  cells  y  E  J\f  must  have  there  visitation 
counts  incremented  by  dW(x,y).  A  further  compli¬ 
cation  arises  if  IT  depends  not  only  on  the  locations 
of  x  and  y,  but  also  their  (or  other  states)  cost  values. 
For  instance,  if  a  c-space  expansion  defined  the  cost 
of  traversing  d  through  state  x  as 

dmaxCh) 
ye  AT 

then  only  the  state  y  with  the  maximum  cost  in  J\f 
should  have  its  visitation  count  incremented  by  d;  all 
other  states  would  not  affect  the  planner  perceived 
cost  of  traversing  x  under  the  current  C.  Unlike  (25), 
this  form  results  in  non-convexity  in  the  LEARCH 
algorithm  (in  addition  to  non-convexity  from  replan¬ 
ning),  as  different  initial  cost  functions  Co  may  pro¬ 
duce  significantly  different  results. 

4.3.3  Planners  with  Motion  Constraints 

A  common  architecture  for  mobile  robot  planning 
systems  is  to  utilize  a  hierarchy  of  planners,  rang¬ 
ing  from  long-range  low  resolution  planners  to  short- 
range  high  resolution  planners.  The  simplest  form 
of  this  architecture  utilizes  a  long-range,  uncon¬ 
strained  ’global’  planner,  and  a  short-range,  kine¬ 
matically  or  dynamically  constrained  ’local’  planner 


[Kelly  et  ah,  2006,  Stentz  et  ah,  2007].  Usually,  a  lo¬ 
cal  planner  does  not  plan  all  the  way  to  the  goal;  in¬ 
stead  it  produces  a  set  of  feasible  short-range  actions, 
and  utilizes  the  global  planner  to  produce  a  path  from 
the  end  of  the  action  to  the  goal.  Local  planner  gen¬ 
erated  plans  are  not  actually  followed  to  completion, 
but  instead  are  replanned  at  a  high  rate.  In  this  way, 
the  local  planner  is  not  actually  optimal,  but  instead 
performs  something  akin  to  a  greedy  search. 

If  LEARCH  is  to  be  used  for  an  onboard  percep¬ 
tion  system,  it  is  important  that  they  be  learned  with 
respect  to  the  planning  system  that  will  be  used  on¬ 
board  the  robot.  If  a  hybrid  architecture  is  used, 
LEARCH  must  therefore  be  implemented  using  the 
same  configuration.  It  is  important  that  the  decisions 
the  planner  is  tasked  with  making  during  training  are 
exactly  those  that  will  be  required  during  operation. 
For  example,  in  the  case  of  a  greedy  local  planning 
system,  Pi  at  each  iteration  should  be  the  direct  out¬ 
put  of  the  planning  system,  not  the  concatenation  of 
multiple  planning  cycles  (even  though  this  is  how  the 
robot’s  behavior  is  determined  online).  This  is  nec¬ 
essary  for  credit  assignment;  if  at  time  t  the  expert 
swerved  to  avoid  an  obstacle,  then  not  only  must  the 
cost  of  the  obstacle  be  sufficiently  high,  but  the  cost 
of  the  terrain  swerved  over  must  be  sufficiently  low. 
Even  if  subsequent  actions  reduce  the  distance  trav¬ 
eled  during  the  swerve,  the  planner  must  be  willing 
to  perform  a  large  turn  at  time  t  to  begin  that  ma¬ 
neuver. 

Related  to  this  issue  is  one  of  planner  fidelity.  De¬ 
pending  on  discretization  resolution  and  the  level  of 
vehicle  modeling,  it  is  unlikely  that  a  kinematically  or 
dynamically  constrained  planning  system  will  be  able 
to  exactly  recreate  the  same  behavior  as  executed 
during  demonstration.  Inability  to  recreate  demon¬ 
strated  behavior  can  introduce  noise  into  the  learned 
cost  function,  of  magnitude  proportional  to  the  de¬ 
gree  of  mismatch.  Possible  solutions  for  this  issue 
are  discussed  in  the  next  two  sections. 


5  Experimental  Results 

Learning  cost  functions  by  demonstration  was  applied 
to  the  Crusher  autonomous  system  (Figure  1).  The 
Crusher  vehicle  is  capable  of  traversing  rough,  com¬ 
plex,  and  unstructured  terrain;  as  such  understanding 
the  relative  benefits  and  tradeoffs  of  various  terrain 
types  is  of  paramount  importance  to  its  autonomous 
operation.  On  Crusher,  terrain  data  comes  from  two 
primary  sources.  The  first  is  prior,  static,  overhead 
data  sources  (satellite  or  aerial  imagery,  aerial  Li- 
DAR,  etc.).  Overhead  data  is  processed  via  a  set 
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Figure  11:  A  high  level  block  diagram  of  the  Crusher 
Autonomy  system 

of  engineered  feature  extractors  into  a  set  of  feature 
maps,  which  are  then  mapped  into  a  single,  static 
costmap  for  an  entire  area  of  operation.  The  sec¬ 
ond  source  of  terrain  data  comes  from  the  onboard 
perception  system.  The  onboard  perception  system 
processes  local  data  from  onboard  camera  images  and 
LiDAR  into  a  dynamic  stream  of  features.  At  a  high 
data  rate,  these  features  are  continuously  mapped  to 
costmaps  over  a  local  area. 

Costs  from  both  sources  are  continuously  fused  into 
a  single  consistent  costmap,  which  is  then  passed 
to  Crusher’s  motion  planning  system.  Fusing  prior 
and  onboard  perceptual  data  at  the  cost  level  al¬ 
lows  for  Crusher  to  continuously  plan  a  path  all 
the  way  from  its  current  position  to  the  goal.  Due 
to  the  dynamic  nature  of  this  cost  data,  the  Field 
D*  algorithm  is  utilized  [Ferguson  and  Stentz,  2006]. 
In  order  to  determine  feasible  local  motion  com¬ 
mands  for  Crusher,  a  variant  of  the  RANGER  sys¬ 
tem  [Kelly,  1995,  Kelly  et  ah,  2006]  is  applied,  uti¬ 
lizing  the  Field  D*  plan  for  global  guidance.  This 
architecture  is  shown  is  Figure  11. 

Early  in  Crusher’s  development,  the  task  of  in¬ 
terpreting  both  static  and  dynamic  perceptual  data 
into  costs  was  performed  through  manual  engineer¬ 
ing,  and  was  a  large  source  of  frustration.  This  led 
to  the  application  of  learning  by  demonstration  to 
constructing  cost  functions.  This  was  first  applied  in 
the  static  perceptual  case  (overhead  data)  utilizing 
the  Field  D*  planner,  and  was  next  applied  to  the 
dynamic  case  (onboard  perceptual  data)  utilizing  the 
Field  D*  guided  RANGER  local  planner.  The  re¬ 
mainder  of  this  section  describes  these  experiments, 
along  with  offline  results  from  each  task. 

Two  metrics  are  used  for  evaluating  offline  perfor¬ 
mance.  The  first  is  the  average  loss  along  a  path.  As 
the  loss  function  is  constructed  to  encode  the  similar¬ 
ity  between  two  paths,  the  loss  between  an  example 
path  and  the  corresponding  planned  path  (over  the 
interval  [0,1])  is  a  measure  of  how  accurately  expert 
behavior  has  been  reproduced.  A  second  measure 
is  the  cost  ratio,  defined  as  the  cost  of  an  example 
path  divided  by  the  cost  of  the  corresponding  planned 
path.  As  this  ratio  approaches  1,  it  indicates  the  cur- 


(a)  Simulated  Examples 


(b)  Expert  Drawn  Examples 


Figure  12:  Learning  simulated  and  expert  drawn  ex¬ 
ample  paths.  Test  set  performance  is  shown  as  a  func¬ 
tion  of  number  of  input  paths 

rent  cost  function  hypothesis  is  approaching  consis¬ 
tency  with  the  expert  demonstration.  The  cost  ratio 
as  opposed  to  cost  difference  is  used  to  account  for 
the  affect  of  scaling  on  the  cost  function  (with  the 
cost  difference,  simply  scaling  the  costs  closer  to  zero 
would  improve  the  metric,  without  improving  the  rel¬ 
ative  cost  difference). 

5.1  Learning  to  Interpret  Overhead 
Data 

In  order  to  verify  LEARCH  under  ideal  conditions, 
tests  were  first  run  on  simulated  examples.  A  known 
(arbitrary)  cost  function  was  used  to  generate  a  cost 
map  over  a  single  environment  from  its  overhead  fea¬ 
tures;  this  cost  map  was  used  to  produce  paths  be¬ 
tween  random  waypoints.  Different  numbers  of  these 
paths  were  then  used  as  input  for  LEARCH,  and  the 
performance  measured  on  a  large  independent  vali- 
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Figure  13:  Validation  loss  as  a  function  of  the  corri¬ 
dor  size.  Using  corridor  constraints  improves  perfor¬ 
mance  as  long  as  the  corridor  is  not  too  large 


dation  set  of  paths  (generated  in  the  same  manner). 

Figure  12(a)  shows  the  results  using  both  the 
balanced  and  standard  weighting  schemes  (Section 
4.2.1).  As  the  number  of  training  paths  is  increased, 
the  test  set  performance  continues  to  improve.  Each 
input  path  further  constrains  the  space  of  possible 
cost  functions,  bringing  the  learned  function  closer 
to  the  desired  one.  However,  there  are  diminishing 
returns  as  additional  paths  overlap  to  some  degree 
in  their  constraints.  Finally,  the  performance  of  the 
balanced  and  standard  weighting  schemes  is  similar. 
Since  all  paths  for  this  experiment  were  generated  by 
a  planner,  they  are  by  definition  optimal  under  some 
metric,  and  therefore  both  achievable  and  consistent 
with  each  other. 

Next,  experiments  were  performed  with  expert  ex¬ 
amples  (both  training  and  validation)  drawn  on  top  of 
overhead  data  maps.  Figure  12(b)  shows  the  results 
of  an  experiment  of  the  same  form  as  that  performed 
with  simulated  examples.  Again,  the  validation  set 
cost  ratio  decreases  as  the  number  of  training  exam¬ 
ples  increases.  However,  with  real  examples  there 
is  a  significant  difference  between  the  two  weight¬ 
ing  schemes;  the  balanced  weighting  scheme  achieved 
significantly  better  performance.  This  demonstrates 
both  how  human  demonstration  can  suffer  from  large 
scale  non-optimality,  and  how  LEARCH  can  be  made 
robust  to  this  fact  through  a  balanced  regression. 

Another  series  of  experiments  were  performed  to 
determine  the  effect  of  performing  replanning  with 
corridor  constraints  (Section  4.2.2).  For  these  ex¬ 
periments,  the  performance  of  learning  was  measured 
with  validation  loss,  to  indicate  how  well  the  planned 
paths  matched  the  examples.  When  measuring  loss 
on  the  validation  set,  no  replanning  was  performed. 


Therefore,  in  order  to  provide  a  smoother  metric,  the 
specific  loss  function  used  was  a  radial  basis  function 
between  states  on  the  current  path  P*  and  the  closest 
state  on  the  example  Pe,  with  a  scale  parameter  a2 


L(P*,Pe) 


1 

W\ 


E  [1- exp  (min  [||ar  -  Xi\\2]/a2)] 

Z '  Xi^Pe 
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Using  a  loss  function  of  this  form  provides  a  more  ana¬ 
log  metric  than  a  Hamming  style  loss  as  previously 
described.  Figure  13  shows  the  results  on  the  valida¬ 
tion  set  as  a  function  of  the  corridor  size  (in  cells). 
Small  corridors  provide  an  improvement  over  no  cor¬ 
ridor,  demonstrating  how  small  scale  smoothing  can 
improve  generalization.  However,  as  the  corridor  gets 
too  large,  this  improvement  disappears;  large  corri¬ 
dors  essentially  over-smooth  the  examples  and  begin 
to  miss  critical  information. 

Finally,  experiments  were  performed  in  order  to 
compare  the  offline  performance  of  learned  costmaps 
with  engineered  ones.  A  cost  map  was  trained  off  of 
satellite  imagery  for  an  approximately  60  km2  size 
environment.  An  engineered  costmap  had  been  pre¬ 
viously  produced  for  this  same  test  site  to  support 
Crusher  operations.  This  engineered  map  was  pro¬ 
duced  by  performing  a  supervised  classification  of  the 
imagery,  and  then  manually  determining  a  cost  for 
each  class  [Silver  et  ah,  2006].  A  subset  of  both  maps 
is  shown  in  Figure  15.  The  two  maps  were  compared 
using  a  validation  set  of  paths  generated  by  a  Crusher 
team  member  not  directly  involved  in  the  develop¬ 
ment  of  overhead  costing.  The  average  validation  loss 
using  the  LEARCH  map  was  23%  less  than  the  engi¬ 
neered  map  (Figure  14),  thus  demonstrating  superior 
generalization  of  the  learned  approach. 

Online  validation  of  the  learned  costmaps  was  also 
achieved  during  Crusher  field  tests.  These  tests  con¬ 
sisted  of  Crusher  autonomously  navigating  a  series  of 
courses,  with  each  course  defined  as  a  set  of  widely 
spaced  waypoints.  Courses  ranged  in  length  up  to 
20  km,  with  waypoint  spacing  on  the  order  of  200  to 
1000  m.  These  tests  took  place  at  numerous  locations 
across  the  continental  U.S.,  each  with  highly  varying 
local  terrain  characteristics,  and  sizes  ranging  from 
tens  to  hundreds  of  square  kilometers. 

During  2005  and  2006,  prior  maps  were  primarily 
generated  as  described  in  [Silver  et  ah,  2006].  An  ini¬ 
tial  implementation  of  the  LEARCH  algorithm  was 
also  field  tested  and  demonstrated  in  smaller  tests 
during  2006.  During  2007  and  2008,  LEARCH  be¬ 
came  the  default  approach  for  producing  cost  maps. 
Overall,  LEARCH  maps  were  used  during  more  than 
600  km  of  sponsor  monitored  autonomous  traverse, 
plus  hundreds  of  kilometers  more  of  additional  field 
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Experiment 

Total  Net 
Distance(km) 

Avg. 

Speed(m/s) 

Total  Cost 
Incurred 

Max  Cost 
Incurred 

Experiment  1 
Learned 

6.63 

2.59 

11108 

23.6 

Experiment  1 
Engineered 

6.49 

2.38 

14385 

264.5 

Experiment  2 
Learned 

6.01 

2.32 

17942 

100.2 

Experiment  2 
Engineered 

5.81 

2.23 

21220 

517.9 

Experiment  2 
No  Prior 

6.19 

1.65 

26693 

224.9 

Table  1:  Results  of  experiments  comparing  learned  to 
vehicle’s  onboard  perception  system. 
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Engineered  Learned 


Figure  14:  Performance  comparison  between  a 
learned  and  engineered  prior  cost  map.  The  learned 
map  produced  behavior  that  better  matched  an  in¬ 
dependent  validation  set  of  examples. 

testing.  This  demonstrates  that  a  learned  cost  func¬ 
tion  was  sufficient  for  use  online  a  complex  robotic 
system. 

In  addition,  two  direct  online  comparisons  were 
performed.  These  two  tests  were  performed  several 
months  apart,  at  different  test  sites.  During  each 
experiment,  the  same  course  was  run  twice,  vary¬ 
ing  only  the  prior  cost  map  given  to  the  vehicle  be¬ 
tween  runs.  The  purpose  of  these  experiments  was 
to  demonstrate  that  learning  a  cost  function  not  only 
generalized  better  with  respect  to  initial  route  plan¬ 
ning,  but  also  with  respect  to  dynamic  replanning 
online. 

Each  run  was  scored  according  to  the  total  cost 


engineered  prior  maps.  Indicated  costs  are  from  the 


incurred  by  the  vehicle  according  to  its  onboard  per¬ 
ception  system.  At  the  time  of  these  experiments,  the 
perception  system  made  use  of  a  manually  engineered 
cost  function.  However,  this  function  was  shown 
through  numerous  experiments  [Stentz  et  ah,  2007] 
to  result  in  a  high  level  of  autonomous  performance; 
therefore  it  is  a  valid  metric  for  scoring  the  safety  of 
a  single  autonomous  run. 

The  results  of  these  experiments  are  shown  in  Ta¬ 
ble  1.  In  both  experiments,  the  vehicle  traveled  far¬ 
ther  to  complete  the  same  course  using  learned  prior 
data,  and  yet  incurred  less  total  (online)  cost.  Over 
both  experiments,  with  each  waypoint  to  waypoint 
section  considered  an  independent  trial,  the  improve¬ 
ment  in  average  cost  and  average  speed  is  statisti¬ 
cally  significant  at  the  5%  and  10%  levels,  respec¬ 
tively.  This  indicates  that  the  terrain  the  vehicle  tra¬ 
versed  was  on  average  safer  when  using  the  learned 
prior  map,  according  to  its  own  onboard  perception 
system.  This  normalization  by  distance  traveled  is 
necessary  because  the  learned  prior  and  engineered 
perception  cost  functions  do  not  necessarily  agree  in 
what  they  consider  unit  cost.  Additionally,  the  max¬ 
imum  cost  incurred  at  any  point  along  an  experiment 
is  also  provided;  for  both  terrains,  the  maximum  is 
significantly  lower  when  using  the  learned  prior  data. 
The  course  for  Experiment  2  was  also  run  without 
any  prior  data;  the  results  are  presented  for  compar¬ 
ison. 

In  addition  to  improving  vehicle  performance,  us¬ 
ing  learned  cost  functions  also  reduced  the  necessary 
amount  of  human  interaction.  When  preparing  for  a 
Crusher  test  using  engineered  cost  maps,  performing 
a  supervised  classification  and  tuning  the  cost  func¬ 
tion  would  take  on  the  order  of  1-2  days.  In  contrast, 
when  using  learned  costmaps  drawing  example  paths 
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Figure  15:  A  10  km2  section  of  a  Crusher  test  site. 
From  top  to  bottom:  Quickbird  imagery,  Learned 
Cost,  and  Engineered  Cost 

would  require  on  the  order  of  1-2  hours8.  In  a  timed 
head-to-head  experiment  on  a  2  km2  test  site,  pro¬ 
ducing  a  supervised  classification  required  40  minutes 
of  expert  involvement,  and  tuning  a  cost  function  re¬ 
quired  an  additional  20  minutes.  In  contrast,  produc¬ 
ing  example  paths  required  only  12  minutes.  On  this 
same  experiment,  the  learned  costmap  had  a  valida¬ 
tion  loss  of  0.43,  compared  to  0.56  for  the  engineered 
map.  This  demonstrates  that  the  learned  approach 

8  Neither  of  these  timings  include  the  necessary  effort  to 
process  raw  overhead  data  into  feature  maps,  as  this  process 
is  a  shared  precursor  to  both  approaches  to  costing 
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Figure  16:  Performance  comparison  between  3  ap¬ 
proaches  to  generating  prior  costmaps.  LEARCH  not 
only  performs  better  and  faster  at  the  task  of  deter¬ 
mining  costs  of  different  semantic  classes,  it  also  does 
a  better  job  at  interpreting  raw  data  (for  the  purpose 
of  costing)  than  semantic  classification. 

produced  superior  performance,  with  less  human  in¬ 
teraction  time  (Figure  16). 

An  additional  test  was  performed  in  which  the 
same  training  set  of  example  paths  was  used  to  learn 
a  cost  function  only  from  the  results  of  the  supervised 
classification;  in  this  case  the  learned  map  had  a  vali¬ 
dation  loss  of  0.52.  This  demonstrates  two  important 
points.  The  first  is  that  even  when  the  problem  of 
learning  a  cost  function  was  reduced  to  solely  a  low 
dimensional  parameter  tuning  problem  (in  this  case 
5  dimensions),  the  automated  approach  was  able  to 
perform  better  than  manual  tuning,  and  with  less 
human  interaction.  The  second  point  is  that  reduc¬ 
ing  the  task  to  a  lower  dimensional  problem  (labeling 
for  the  supervised  classification)  required  additional 
interaction,  and  that  this  feature  space  compression 
resulted  in  a  loss  of  useful  information  (as  the  val¬ 
idation  loss  was  better  when  learning  from  the  full 
feature  space  as  opposed  to  the  compressed  one). 

5.2  Learning  to  Interpret  Online  Per¬ 
ceptual  Data 

Next,  dynamic  LEARCH  was  applied  to  the  task  of 
learning  a  cost  function  for  Crusher’s  onboard  percep¬ 
tion  system  (Figure  17).  Training  data  in  the  form 
of  expert  example  behaviors  was  gathered  by  having 
Crusher’s  safety  operator  RC  the  vehicle  through  sets 
of  waypoints.  Different  training  examples  were  col¬ 
lected  over  a  period  of  months  in  varying  locations 
and  weather  conditions,  and  with  3  different  oper¬ 
ators  at  one  time  or  another.  During  data  collec- 
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Number  of  Regression  Trees 


Figure  18:  Results  of  offline  experiments  on  logged  perception  data.  Left:  Validation  Loss  during  learning 
for  the  balanced  and  standard  weighting  Center:  Validation  Loss  as  a  function  of  the  replanning  corridor 
size  Right:  Validation  Loss  as  a  function  of  the  number  of  regression  trees. 


tion,  all  raw  sensor  data  was  logged  along  with  the 
example  path.  Perceptual  features  were  then  pro¬ 
duced  offline  by  feeding  the  raw  sensor  data  through 
Crusher’s  perception  software.  In  this  way,  the  base 
perception  system  and  its  features  could  be  modified 
and  improved  without  having  to  recollect  new  train¬ 
ing  data;  the  raw  data  is  just  reprocessed,  and  a  cost 
function  learned  for  the  new  features. 

This  set  of  examples  was  first  used  to  perform  a 
series  of  offline  experiments  to  validate  the  dynamic 
extension  of  LEARCH.  Loss  on  a  validation  set  was 
used  as  the  metric,  to  measure  how  well  planned  be¬ 
havior  recreated  example  behavior.  These  results  are 
presented  in  Figure  18.  The  left  graph  again  demon¬ 
strates  the  effectiveness  of  performing  balanced  as  op¬ 
posed  to  unbalanced  regression,  as  the  balanced  ver¬ 
sion  has  superior  validation  performance.  The  center 
graph  further  demonstrates  the  utility  of  replanning 
with  corridor  constraints.  With  a  small  corridor  size, 
the  algorithm  is  able  to  smooth  out  some  of  the  noise 
in  example  human  behavior,  and  improve  generaliza¬ 
tion.  As  the  corridor  size  increases,  the  algorithm 
begins  to  over-smooth,  resulting  in  decreasing  valida¬ 
tion  performance.  This  also  demonstrated  how  vali¬ 
dation  data  can  be  used  to  automatically  determine 
the  optimal  corridor  size. 

An  experiment  was  also  performed  to  assess  the  ef¬ 
fectiveness  of  filtering  out  inconsistent  timesteps.  A 
single  expert  behavior  was  used  to  learn  a  cost  func¬ 
tion,  first  with  no  filtering,  and  then  with  approx¬ 
imately  10%  of  its  timesteps  automatically  filtered. 
As  would  be  expected,  the  performance  of  the  re¬ 
maining  90%  of  the  training  set  improved  after  filter¬ 
ing  (Figure  19).  However,  performance  on  the  vali¬ 
dation  set  also  improved  slightly.  This  demonstrates 
that  filtering  out  inconsistent  timesteps  non  only  im¬ 
proves  performance  on  examples  for  which  the  filtered 
timesteps  were  inconsistent,  it  also  improves  general¬ 


ization  to  unseen  examples. 

As  cost  evaluations  in  an  onboard  perception  sys¬ 
tem  must  be  performed  in  real  time,  the  computa¬ 
tional  cost  of  an  evaluation  is  an  important  consid¬ 
eration.  As  described  in  (Section  4.3.1),  using  only 
linear  regressors  is  beneficial  from  a  computational 
standpoint,  and  feature  learning  can  be  used  to  im¬ 
prove  the  complexity  of  the  cost  function  if  necessary. 
Figure  18  (Right)  shows  validation  loss  as  a  function 
of  the  number  of  added  features  learned  using  simple 
regression  trees.  At  first,  additional  features  improves 
the  validation  performance;  however,  eventually  too 
many  features  can  cause  overfitting. 

Next,  the  collected  training  set  was  used  to  learn 
a  cost  function  to  run  onboard  Crusher.  Originally, 
Crusher  used  an  engineered  perception  cost  func¬ 
tion.  During  more  than  3  years  of  Crusher  develop¬ 
ment,  this  cost  function  was  continually  redesigned 
and  retuned,  culminating  in  a  high  performance  sys¬ 
tem  [Stentz  et  ah,  2007].  However,  this  performance 
came  at  a  high  cost  of  human  effort.  Version  con¬ 
trol  logs  indicate  that  145  changes  were  made  to  just 
the  structure  of  the  model  mapping  perceptual  fea¬ 
tures  to  costs;  additionally  more  than  300  parameter 
changes  were  checked  in  (untold  more  were  tried  at 
one  point  or  another,  requiring  verification  on  logged 
data  or  actual  vehicle  performance).  As  each  com¬ 
mitted  change  requires  significant  time  to  design,  im¬ 
plement,  and  validate,  easily  hundreds  of  hours  were 
spent  on  engineering  the  cost  function.  In  contrast, 
the  time  to  collect  the  final  training  set  for  learning 
by  demonstration  required  only  a  few  hours  of  hu¬ 
man  time  (spread  over  several  months).  This  seem¬ 
ingly  small  amount  of  human  demonstration  is  suf¬ 
ficient  due  to  the  numerous  constraints  implied  by 
each  example  behavior:  the  approximately  3  kilome¬ 
ters  of  demonstration  provides  hundreds  of  thousands 
of  examples  of  states  to  traverse,  and  millions  more 
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(a)  Left  Camera  Image  (b)  Right  Camera  Image 


(c)  Max  Object  Height  (d)  Density  in  Wheel  Re¬ 
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Figure  17:  An  example  of  learned  perception  cost 
from  a  simple  scene,  depicted  in  (a),(b).  Processing 
of  raw  data  results  in  perceptual  features  (a  subset  of 
which  are  shown  in  (c)  -  (g))  that  are  mapped  into 
cost  (h)  by  a  learned  function. 


Figure  19:  Comparison  of  performance  with  and 
without  filtering.  Filtering  out  inconsistent  timesteps 
improves  performance  on  both  training  example 
which  were  inconsistent  with  the  filtered  examples, 
as  well  as  on  an  independent  validation  set. 


examples  of  states  that  were  avoided. 

Crusher’s  perception  system  actually  consists  of 
two  modules:  a  local  perception  system  that  pro¬ 
duces  high  resolution  cost  data  out  to  20m,  and  a 
far  range  perception  system  that  produces  medium 
resolution  cost  data  out  to  60m.  The  far  range 
system  utilizes  near-to-far  learning  as  described  in 
[Sofman  et  ah,  2006]  for  learning  a  far  range  cost 
function  from  near  range  cost  data.  Therefore,  when 
the  system  learns  online  from  scratch,  the  near  range 
cost  function  implicitly  defines  the  far  range  cost 
function.  For  this  reason,  learning  by  demonstra¬ 
tion  was  only  used  to  learn  the  near  range  function; 
the  far  range  function  would  automatically  adapt  to 
whatever  near  range  function  was  utilized. 

As  Crusher  utilizes  a  hybrid  local/global  plan¬ 
ning  architecture,  it  was  necessary  to  learn  a  cost 
function  with  respect  to  the  lowest  level  of  motion 
planning  (see  Section  4.3.3);  otherwise  the  learned 
cost  function  would  not  actually  recreate  demon¬ 
strated  behaviors  when  applied  online.  However, 
the  local  planner  suffers  from  low  fidelity;  while 
the  concatenation  of  multiple  planning  cycles  can 
produce  behavior  of  surprising  complexity,  indi¬ 
vidual  planning  cycles  consider  only  a  relatively 
small  subset  of  possible  actions.  This  approach 
has  proven  quite  effective  on  Crusher,  as  well  as  in 
previous  off  road  mobile  systems  [Singh  et  ah,  2000, 
Kelly  et  ah,  2006,  Biesiadecki  and  Maimone,  2006, 
Carsten  et  ah,  2009].  Unfortunately,  as  previously 
stated  this  inability  for  the  planner  to  recreate 
demonstrated  behavior  in  a  single  planning  cycle  can 


27 


produce  noise  in  a  learned  cost  function. 

If  there  were  some  way  to  map  a  demonstrated  be¬ 
havior  to  the  appropriate  planner  action  at  each  cy¬ 
cle,  then  the  selection  of  that  action  could  serve  as  the 
proper  termination  condition.  Unfortunately,  collect¬ 
ing  this  information  during  demonstration  by  an  ex¬ 
pert  would  be  extremely  tedious,  requiring  an  expert 
selection  at  every  planning  cycle.  Instead,  a  heuristic 
approach  to  approximating  this  decision  was  imple¬ 
mented  [Silver  et  ah,  2009a].  Essentially,  this  tech¬ 
nique  seeks  to  project  the  expert’s  example  behavior 
onto  the  space  of  possible  planner  actions.  This  was 
performed  by  first  learning  a  perception  cost  func¬ 
tion  for  Crusher’s  global  planner  (as  this  planner  is 
not  kinematically  constrained,  it  can  better  repro¬ 
duce  demonstrated  paths).  Such  a  cost  function  will 
generally  underestimate  the  cost  necessary  for  the 
equivalent  kinematically  constrained  planner.  There¬ 
fore,  each  potential  constrained  planner  plan  is  scored 
by  its  average  cost  instead  of  total  cost.  A  plan  with 
low  average  cost  can  not  be  said  to  be  optimal,  but  it 
traverses  desirable(low  cost)  terrain.  An  additional 
penalty  based  on  path  length  is  added  to  bias  scores 
towards  those  that  make  progress  towards  the  goal9. 
After  scoring  each  action,  that  with  the  lowest  score 
is  used  as  the  new  example.  The  result  of  this  initial 
replanning  step  is  to  produce  a  new  example  that  is 
feasible  to  the  local  planner. 

The  performance  of  different  perception  cost  func¬ 
tions  was  compared  through  more  than  150  km  of 
comparison  trials.  The  final  results  comparing  4  dif¬ 
ferent  cost  functions  are  presented  in  Table  2.  In 
addition  to  the  engineered  cost  function,  3  learned 
cost  functions  were  compared:  one  using  the  global 
planner,  one  using  the  local  planner,  and  one  using 
the  local  planner  with  the  initial  heuristic  replanning. 
Unfortunately,  as  the  cost  function  itself  is  the  vari¬ 
able  being  tested,  cost  can  not  be  used  as  a  metric 
for  comparing  the  relative  safety  of  each  systems  be¬ 
havior.  Therefore,  various  proprioceptive  statistics 
were  recorded  to  offer  an  approximate  quantitative 
description  of  safety.  The  number  of  safety  related 
e- stops  was  also  recorded.  The  statistics  for  each 
learned  system  were  then  compared  on  a  waypoint 
by  waypoint  basis  to  the  engineered  system,  to  test 
for  statistical  significance.  While  it  is  not  possible  to 
convert  these  numbers  into  a  single  quantitative  met¬ 
ric  for  comparison10,  it  is  possible  to  make  a  broad 
relative  comparison  between  systems  by  observing  in- 

9 the  weight  of  this  penalty  can  be  automatically  tuned  by 
optimizing  performance  on  a  validation  set,  preventing  any 
hand  tuning 

10 doing  so  would  require  a  manually  tuned  and  essentially 
arbitrary  weighting,  the  very  problem  this  work  seeks  a  solu¬ 
tion  to 


dividual  statistics. 

In  comparison  to  the  engineered  system,  the  cost 
function  learned  for  the  global  planner  resulted  in 
overly  aggressive  performance.  As  discussed  in  Sec¬ 
tion  4.3.3,  learning  in  this  manner  does  not  result  in 
sufficiently  high  costs  for  Crusher  to  avoid  obstacles 
online.  The  empirical  result  is  that  Crusher  drives 
faster,  turns  less,  and  backs  up  less,  while  suffering 
from  increased  mobility  risk  (in  the  form  of  twice  the 
rate  of  safety  e-stops).  In  contrast,  the  cost  func¬ 
tion  learned  for  the  local  planner  resulted  in  perfor¬ 
mance  very  similar  to  that  of  the  already  high  perfor¬ 
mance  engineered  system;  The  only  significant  differ¬ 
ence  was  the  learned  system  turned  slightly  less  (as 
indicated  by  lower  average  angular  velocity  and  lat¬ 
eral  speed  /  acceleration) . 

Adding  an  initial  heuristic  replanning  step  to  learn¬ 
ing  a  cost  function  for  the  local  planner  resulted 
in  a  system  that  also  maintained  a  seemingly  equal 
level  of  safety  to  the  engineered  system;  the  differ¬ 
ence  in  safety  e-stops  was  not  statistically  significant, 
but  there  was  a  significant  decrease  in  the  wear  on 
Crusher  in  the  form  of  lower  motor  current  draw  and 
less  suspension  travel.  However,  this  equal  safety  was 
achieved  with  seemingly  more  aggressiveness  than  the 
engineered  cost  function.  This  is  indicated  by  a  sta¬ 
tistically  significant  decrease  not  only  in  angular  ve¬ 
locity  and  lateral  movement,  but  also  in  the  amount 
of  backing  up,  with  a  significant  increase  in  average 
speed.  The  effect  of  the  initial  replanning  stage  is 
to  reduce  noise  in  the  cost  function;  the  reduction  of 
this  noise  allows  the  vehicle  to  alter  its  behavior  less 
in  the  face  of  false  positive  high  cost  regions,  while 
still  avoiding  true  obstacles.  The  overall  result  is  a 
slight  performance  improvement,  achieved  with  or¬ 
ders  of  magnitude  less  human  effort. 

6  Conclusion 

This  paper  addresses  the  task  of  interpreting  per¬ 
ceptual  data  for  use  in  autonomous  navigation.  We 
have  shown  the  applicability  of  learning  from  ex¬ 
pert  demonstration  towards  improving  the  robustness 
of  autonomous  navigation  systems,  while  helping  to 
minimize  the  necessary  amount  of  expert  interaction. 
Specifically,  the  parameter  tuning  problem  that  often 
results  from  the  coupling  of  complex  perception  and 
planning  systems  can  be  automated  through  expert 
demonstration  instead  of  expert  intervention.  This 
automated  approach  not  only  reduces  development 
and  validation  time,  but  can  produce  a  more  efficient 
and  robust  system  than  manual  approaches. 

Most  importantly,  this  approach  provides  a  formal- 
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System 

Avg  Dist 
Made  Good  (m) 

Avg  Cmd 
Vel  (m/s) 

Avg  Cmd 
Ang  Vel.(o/s) 

Avg  Lat 
Vel  (m/s) 

Dir  Switch 
Per  m 

Avg  Motor 
Current  (A) 

Engineered 

130.7 

3.24 

6.56 

0.181 

0.107 

7.53 

Global 

123.8* 

3.34* 

4.96* 

0.170* 

0.081* 

7.11* 

Local 

127.3 

3.28 

5.93* 

0.172* 

0.100 

7.35 

Local  w/replan 

124.3* 

3.39* 

5.08* 

0.170* 

0.082* 

7.02* 

System 

Avg 

Roll(o) 

Avg 

Pitch(o) 

Avg  Vert 
Accel  (m/s2) 

Avg  Lat 
Accel  (m/s2) 

Susp 

MaxA  (m) 

Safety 

E-stops 

Engineered 

4.06 

2.21 

0.696 

0.997 

0.239 

0.027 

Global 

4.02 

2.22 

0.710* 

0.966* 

0.237 

0.054* 

Local 

4.06 

2.22 

0.699 

0.969* 

0.237 

0.034 

Local  w/replan 

3.90* 

2.18 

0.706* 

0.966* 

0.234* 

0.030 

Table  2:  Averages  over  295  different  waypoint  to  waypoint  trials  per  perception  system,  totaling  over  150km 
of  traverse.  Statistically  significant  differences  (from  Engineered)  denoted  by  * 


ism  for  how  a  cost  function  should  be  defined;  it  is  the 
simplest  function  that  meets  constraints  implied  by 
example  behavior.  As  opposed  to  a  hand  tuned  ap¬ 
proach  (which  can  easily  overfit)  this  simplicity  pro¬ 
vides  a  theoretical  basis  for  achieving  robustness  and 
generalization.  Further,  this  approach  naturally  re¬ 
sults  in  an  automated  framework  for  both  initial  and 
continuing  cost  function  validation. 

Future  work  will  explore  the  application  of  this  ap¬ 
proach  to  learning  cost  functions  over  full  state-action 
pairs.  The  behavior  of  an  autonomous  navigation 
system  is  defined  not  only  by  which  terrain  it  prefers, 
but  by  which  motions  it  prefers  (e.g.  minimum  cur¬ 
vature).  Improper  modeling  of  these  preferences  is 
just  as  significant  a  detriment  to  robot  performance 
as  improper  modeling  of  terrain  preferences;  it  also 
contributes  to  the  inability  of  some  planning  systems 
to  properly  recreate  demonstrated  behavior.  Previ¬ 
ous  work  [Abbeel  et  ah,  2008]  has  briefly  investigated 
this  challenge  independently  of  the  equivalent  percep¬ 
tion  problem.  However,  by  learning  all  preferences 
at  once  from  human  demonstration,  we  hope  to  fur¬ 
ther  improve  the  robustness  of  autonomous  naviga¬ 
tion,  while  further  reducing  the  effort  involved  in  de¬ 
ployment  and  validation  of  such  systems. 
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